Project

General

Profile

Actions

tickets #161774

open

Flapping HAProxy health checks

Added by crameleon 8 months ago. Updated 3 months ago.

Status:
Workable
Priority:
Normal
Assignee:
Category:
Servers hosted in PRG
Target version:
-
Start date:
2024-06-04
Due date:
% Done:

0%

Estimated time:

Description

Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1:

Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP.
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP

The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise.

On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump).

  • it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies
    => likely no overload caused from the outside.

  • it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) and backend services (nginx, httpd, Kanidm, MariaDB).
    => likely not a backend software problem

  • no network changes were implemented.

  • the failure always happens simultaneously (in the same second) on both proxies in a given HA pair.

  • the timeline suggests it having started to happen after the installation of updates last night
    => on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base
    => the only stale files reported by zypper ps were dbus related -> performed reboot of a proxy server, did not help
    => rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not help

  • simulating the HTTP health checks with curl works fine - even in a while true loop

  • ping does not show any packet loss

Example graph showing the heavy flapping on hel2 starting today:

https://monitor.opensuse.org/grafana/d/rEqu1u5ue/haproxy?orgId=1&refresh=1m&from=1717420935609&to=1717507335609&var-DS_PROMETHEUS=default&var-host=hel2.infra.opensuse.org&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&viewPanel=185

Actions #1

Updated by crameleon 8 months ago

  • Private changed from Yes to No
Actions #2

Updated by crameleon 8 months ago

  • Description updated (diff)
Actions #3

Updated by crameleon 8 months ago · Edited

For debugging, I am using the hel2.i.o.o proxy server (currently the internal backup and also where I attempted the glibc rollback) and the netbox1.i.o.o backend (very low traffic and not yet used for production).

Actions #4

Updated by crameleon 8 months ago

tcpdump on a backend piped into Wireshark shows something like the following around the second HAProxy reports the health check failure:

6580    19:23:55.134169 2a07:de40:b27e:1211::a  2a07:de40:b27e:1203::11 TCP 94  [TCP Retransmission] 443 → 54390 [SYN, ACK] Seq=0 Ack=1 Win=64260 Len=0 MSS=1440 SACK_PERM TSval=2834215675 TSecr=1741091386 WS=128
6581    19:23:55.135931 2a07:de40:b27e:1203::11 2a07:de40:b27e:1211::a  TCP 74  54390 → 443 [RST] Seq=1 Win=0 Len=0

From what I read online, the RST is expected, it is what HAProxy uses to quickly end health checks.
The TCP Retransmission however does not seem quite right.

At one occurrence there was a duplicate ACK as well, but that does not seem to happen as often.

Actions #5

Updated by crameleon 8 months ago

With this information, I also found something I did not spot in the monitoring before - the TCP Retransmission rate on the backend is visibly higher since when the problems started:

https://monitor.opensuse.org/grafana/d/rYdddlPWk/node-metrics?orgId=1&var-DS_PROMETHEUS=default&var-job=nodes&var-node=netbox1.infra.opensuse.org&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=1717349583894&to=1717522383894&viewPanel=104

(the orange part)

Actions #8

Updated by crameleon 8 months ago

On the hypervisor running the backend and proxy server I am troubleshooting with, as well as on the hypervisor hosting the firewall, no issues with the network bonding are noticeable:

falkor22 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0
orbit20 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0
Actions #9

Updated by crameleon 8 months ago

On the hypervisor running the firewall, some updates were installed as well:

orbit20 (Hypervisor):~ # grep ^2024-06-04 /var/log/zypp/history
2024-06-04 01:05:35|command|root@orbit20|'zypper' 'patch' '-y' '--auto-agree-with-product-licenses' '--category' 'security'|
2024-06-04 01:05:35|install|glibc|2.31-150300.83.1|x86_64||repo-sle-update|1278b347a1da5f21c19740d4f09cce456252985ae7f6f94302a4d12589f22ece|
2024-06-04 01:05:36|install|glibc-locale-base|2.31-150300.83.1|x86_64||repo-sle-update|782c994c42c27b7c1e8b1cf68ab61ea6a080d1e829594748789d9367fb5fbcc3|
2024-06-04 01:05:39|install|glibc-locale|2.31-150300.83.1|x86_64||repo-sle-update|a5bd9f6d2f87df5ab2eb3adef917fcd26f99eac3079c60e648fb7e2c83729353|
2024-06-04 01:05:39|patch  |openSUSE-SLE-15.5-2024-1895|1|noarch|repo-sle-update|important|security|needed|applied|

orbit20 (Hypervisor):~ # zypper ps -s
The following running processes use deleted files:

PID   | PPID | UID | User       | Command            | Service
------+------+-----+------------+--------------------+----------
1332  | 1    | 499 | messagebus | dbus-daemon        | dbus
20866 | 1    | 0   | root       | virtlogd           | virtlogd
21093 | 1    | 107 | qemu       | qemu-system-x86_64 |
21096 | 1    | 0   | root       | virtlockd          | virtlockd
Actions #10

Updated by crameleon 8 months ago

Did full updates + reboot on firewalls and their hypervisors now (usually I would do this every other weekend, but maybe it helps here).

Actions #11

Updated by crameleon 8 months ago

  • Status changed from New to In Progress
  • Assignee set to crameleon
Actions #12

Updated by crameleon 8 months ago

On a first glance (~20 minutes after the update/reboot dance), this seems to have done the trick. Monitoring shows no retransmissions since and neither did the health checks flap. But going to continue actively monitoring for a while ...

Actions #13

Updated by crameleon 8 months ago

  • Status changed from In Progress to Resolved

That seems to have done it, but two backends (discourse01 and netbox1) needed the same.

The graphs show mostly flat RetransSegs curves again (minimal spikes on proxies, none on firewalls and backends).

Alerts are unmuted again.

It would have been great to understand what the real cause was (glibc update, dbus, something in the host kernel, something in the guest kernel, QEMU, ...) but I suppose I'll never find out.

Certainly one of the more peculiar issues.

Actions #14

Updated by crameleon 7 months ago

  • Status changed from Resolved to New

It is happening again today ...

Actions #16

Updated by crameleon 7 months ago · Edited

Curiously, this time no updates were installed in the last 1-2 days on any of the involved machines (proxy servers, backends, hypervisors - only the firewalls and their hypervisors received their 15.6 upgrade yesterday, but including a full reboot).

Actions #17

Updated by crameleon 7 months ago

  • Status changed from New to In Progress

Installed perf tools and run a tmux with tcpretrans -s on hel2 and netbox1.

Actions #18

Updated by crameleon 5 months ago

  • Status changed from In Progress to Workable

Lately not observed, but needs investigation if the retrans behavior is actually gone.

Actions #19

Updated by crameleon 3 months ago

Behavior is still present, submitted https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2115 to lower the alerting level until this is solved. Once it is, the patch should be reverted.

Actions

Also available in: Atom PDF