tickets #161774
openFlapping HAProxy health checks
0%
Description
Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1:
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING] (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP.
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING] (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP
The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise.
On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump).
it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies
=> likely no overload caused from the outside.it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) and backend services (nginx, httpd, Kanidm, MariaDB).
=> likely not a backend software problemno network changes were implemented.
the failure always happens simultaneously (in the same second) on both proxies in a given HA pair.
the timeline suggests it having started to happen after the installation of updates last night
=> on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base
=> the only stale files reported byzypper ps
were dbus related -> performed reboot of a proxy server, did not help
=> rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not helpsimulating the HTTP health checks with
curl
works fine - even in awhile true
loopping does not show any packet loss
Example graph showing the heavy flapping on hel2 starting today:
Updated by crameleon 8 months ago
tcpdump on a backend piped into Wireshark shows something like the following around the second HAProxy reports the health check failure:
6580 19:23:55.134169 2a07:de40:b27e:1211::a 2a07:de40:b27e:1203::11 TCP 94 [TCP Retransmission] 443 → 54390 [SYN, ACK] Seq=0 Ack=1 Win=64260 Len=0 MSS=1440 SACK_PERM TSval=2834215675 TSecr=1741091386 WS=128
6581 19:23:55.135931 2a07:de40:b27e:1203::11 2a07:de40:b27e:1211::a TCP 74 54390 → 443 [RST] Seq=1 Win=0 Len=0
From what I read online, the RST is expected, it is what HAProxy uses to quickly end health checks.
The TCP Retransmission however does not seem quite right.
At one occurrence there was a duplicate ACK as well, but that does not seem to happen as often.
Updated by crameleon 8 months ago
On the hypervisor running the backend and proxy server I am troubleshooting with, as well as on the hypervisor hosting the firewall, no issues with the network bonding are noticeable:
falkor22 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0
orbit20 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0
Updated by crameleon 8 months ago
On the hypervisor running the firewall, some updates were installed as well:
orbit20 (Hypervisor):~ # grep ^2024-06-04 /var/log/zypp/history
2024-06-04 01:05:35|command|root@orbit20|'zypper' 'patch' '-y' '--auto-agree-with-product-licenses' '--category' 'security'|
2024-06-04 01:05:35|install|glibc|2.31-150300.83.1|x86_64||repo-sle-update|1278b347a1da5f21c19740d4f09cce456252985ae7f6f94302a4d12589f22ece|
2024-06-04 01:05:36|install|glibc-locale-base|2.31-150300.83.1|x86_64||repo-sle-update|782c994c42c27b7c1e8b1cf68ab61ea6a080d1e829594748789d9367fb5fbcc3|
2024-06-04 01:05:39|install|glibc-locale|2.31-150300.83.1|x86_64||repo-sle-update|a5bd9f6d2f87df5ab2eb3adef917fcd26f99eac3079c60e648fb7e2c83729353|
2024-06-04 01:05:39|patch |openSUSE-SLE-15.5-2024-1895|1|noarch|repo-sle-update|important|security|needed|applied|
orbit20 (Hypervisor):~ # zypper ps -s
The following running processes use deleted files:
PID | PPID | UID | User | Command | Service
------+------+-----+------------+--------------------+----------
1332 | 1 | 499 | messagebus | dbus-daemon | dbus
20866 | 1 | 0 | root | virtlogd | virtlogd
21093 | 1 | 107 | qemu | qemu-system-x86_64 |
21096 | 1 | 0 | root | virtlockd | virtlockd
Updated by crameleon 8 months ago
- Status changed from In Progress to Resolved
That seems to have done it, but two backends (discourse01 and netbox1) needed the same.
The graphs show mostly flat RetransSegs curves again (minimal spikes on proxies, none on firewalls and backends).
Alerts are unmuted again.
It would have been great to understand what the real cause was (glibc update, dbus, something in the host kernel, something in the guest kernel, QEMU, ...) but I suppose I'll never find out.
Certainly one of the more peculiar issues.
Updated by crameleon 3 months ago
Behavior is still present, submitted https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2115 to lower the alerting level until this is solved. Once it is, the patch should be reverted.