bxnt_en driver issue on node-4.ceph
This driver issue caused the mon to be reported as down by Ceph and lot of slow ops reported on node-4's OSD then on the whole infra.
[...]
6 [Fri Dec 9 00:06:59 2022] bnxt_en 0000:63:00.3: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0041 address=0xf96c9004 flags=0x0000]
7 [Fri Dec 9 00:07:00 2022] openvswitch: ovs-system: deferred action limit reached, drop recirc action
8 [Fri Dec 9 00:07:05 2022] ------------[ cut here ]------------
9 [Fri Dec 9 00:07:05 2022] NETDEV WATCHDOG: eno36np3 (bnxt_en): transmit queue 5 timed out
10 [Fri Dec 9 00:07:05 2022] WARNING: CPU: 11 PID: 0 at net/sched/sch_generic.c:472 dev_watchdog+0x270/0x280
11 [Fri Dec 9 00:07:05 2022] Modules linked in: cpuid ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs sch_ingress geneve ip6_udp_tunnel udp_tunnel nfnetlink_cttimeout rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core xsk_diag udp_diag raw_diag unix_diag af_packet_diag tcp_diag inet_diag netlink_diag nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo aufs binfmt_misc dell_rbu overlay 8021q garp mrp stp llc bonding nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua cdc_ether input_leds usbnet mii joydev ipmi_ssif dell_smbios amd64_edac_mod dcdbas edac_mce_amd dell_wmi_descriptor wmi_bmof kvm_amd kvm ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel openvswitch nsh nf_conncount nf_nat iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear mgag200
12 [Fri Dec 9 00:07:05 2022] drm_vram_helper i2c_algo_bit ttm dm_thin_pool drm_kms_helper dm_persistent_data syscopyarea sysfillrect dm_bio_prison hid_generic sysimgblt dm_bufio ahci fb_sys_fops crct10dif_pclmul libcrc32c crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd usbhid cryptd glue_helper tg3 drm libahci bnxt_en hid nvme i2c_piix4 nvme_core wmi
13 [Fri Dec 9 00:07:05 2022] CPU: 11 PID: 0 Comm: swapper/11 Not tainted 5.4.0-132-generic #148-Ubuntu
14 [Fri Dec 9 00:07:05 2022] Hardware name: Dell Inc. PowerEdge R6525/0GK70M, BIOS 1.4.8 05/06/2020
15 [Fri Dec 9 00:07:05 2022] RIP: 0010:dev_watchdog+0x270/0x280
16 [Fri Dec 9 00:07:05 2022] Code: eb 9d 48 8b 5d d0 c6 05 86 8d 2b 01 01 48 89 df e8 45 af fa ff 44 89 e1 48 89 de 48 c7 c7 e0 f3 63 9f 48 89 c2 e8 94 40 14 00 <0f> 0b e9 77 ff ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
17 [Fri Dec 9 00:07:05 2022] RSP: 0018:ffffb01306c74e38 EFLAGS: 00010282
18 [Fri Dec 9 00:07:05 2022] RAX: 0000000000000000 RBX: ffff9f1cc30bc000 RCX: 0000000000000006
19 [Fri Dec 9 00:07:05 2022] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff9f1cdfd5c8c0
20 [Fri Dec 9 00:07:05 2022] RBP: ffffb01306c74e70 R08: 0000000000016700 R09: 0000000000000004
21 [Fri Dec 9 00:07:05 2022] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000005
22 [Fri Dec 9 00:07:05 2022] R13: ffff9f1cc30c5bc0 R14: 000000000000004a R15: ffff9f1cc30bc480
23 [Fri Dec 9 00:07:05 2022] FS: 0000000000000000(0000) GS:ffff9f1cdfd40000(0000) knlGS:0000000000000000
24 [Fri Dec 9 00:07:05 2022] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
25 [Fri Dec 9 00:07:05 2022] CR2: 000055a803e2e1a0 CR3: 000000076c5fa000 CR4: 0000000000340ee0
26 [Fri Dec 9 00:07:05 2022] Call Trace:
27 [Fri Dec 9 00:07:05 2022] <IRQ>
28 [Fri Dec 9 00:07:05 2022] ? pfifo_fast_enqueue+0x150/0x150
29 [Fri Dec 9 00:07:05 2022] call_timer_fn+0x32/0x130
30 [Fri Dec 9 00:07:05 2022] __run_timers.part.0+0x180/0x280
31 [Fri Dec 9 00:07:05 2022] ? timerqueue_add+0x9b/0xb0
32 [Fri Dec 9 00:07:05 2022] ? enqueue_hrtimer+0x3d/0x90
33 [Fri Dec 9 00:07:05 2022] ? recalibrate_cpu_khz+0x10/0x10
34 [Fri Dec 9 00:07:05 2022] ? ktime_get+0x3e/0xa0
35 [Fri Dec 9 00:07:05 2022] ? native_apic_msr_write+0x2b/0x30
36 [Fri Dec 9 00:07:05 2022] run_timer_softirq+0x2a/0x50
37 [Fri Dec 9 00:07:05 2022] __do_softirq+0xd1/0x2c1
38 [Fri Dec 9 00:07:05 2022] irq_exit+0xae/0xb0
39 [Fri Dec 9 00:07:05 2022] smp_apic_timer_interrupt+0x7b/0x140
40 [Fri Dec 9 00:07:05 2022] apic_timer_interrupt+0xf/0x20
41 [Fri Dec 9 00:07:05 2022] </IRQ>
42 [Fri Dec 9 00:07:05 2022] RIP: 0010:native_safe_halt+0xe/0x10
43 [Fri Dec 9 00:07:05 2022] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 36 35 51 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 26 35 51 00 fb f4 <c3> 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 fd 59 62 ff 65
44 [Fri Dec 9 00:07:05 2022] RSP: 0018:ffffb01300277e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
45 [Fri Dec 9 00:07:05 2022] RAX: ffffffff9ecf7ec0 RBX: 000000000000000b RCX: 0000000000000001
46 [Fri Dec 9 00:07:05 2022] RDX: 00000000a701dcca RSI: ffffb01300277e30 RDI: 0006be08378ed0e9
47 [Fri Dec 9 00:07:05 2022] RBP: ffffb01300277e90 R08: 0000000000000001 R09: 0000000000000001
48 [Fri Dec 9 00:07:05 2022] R10: 0000000000100000 R11: 0000000000000000 R12: 000000000000000b
49 [Fri Dec 9 00:07:05 2022] R13: ffff9f1cddf61780 R14: 0000000000000000 R15: 0000000000000000
50 [Fri Dec 9 00:07:05 2022] ? __cpuidle_text_start+0x8/0x8
51 [Fri Dec 9 00:07:05 2022] ? default_idle+0x20/0x140
52 [Fri Dec 9 00:07:05 2022] arch_cpu_idle+0x15/0x20
53 [Fri Dec 9 00:07:05 2022] default_idle_call+0x23/0x30
54 [Fri Dec 9 00:07:05 2022] do_idle+0x1fb/0x270
55 [Fri Dec 9 00:07:05 2022] cpu_startup_entry+0x20/0x30
56 [Fri Dec 9 00:07:05 2022] start_secondary+0x167/0x1c0
57 [Fri Dec 9 00:07:05 2022] secondary_startup_64+0xa4/0xb0
58 [Fri Dec 9 00:07:05 2022] ---[ end trace cf62f5023d86b23e ]---
59 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: TX timeout detected, starting reset task!
60 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae3d error 0x2
61 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
62 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae3e error 0x2
63 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
64 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae3f error 0x2
65 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
66 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae40 error 0x2
67 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
68 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae41 error 0x2
69 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
70 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae42 error 0x2
71 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
72 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm req_type 0x51 seq id 0xae43 error 0x2
73 [Fri Dec 9 00:07:05 2022] bnxt_en 0000:63:00.3 eno36np3: hwrm_ring_free type 1 failed. rc:ffffffea err:2
[...]