tickets #161774: Flapping HAProxy health checks - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #161774

open

Flapping HAProxy health checks

Added by crameleon 8 months ago. Updated 3 months ago.

Status:

Workable

Priority:

Normal

Assignee:

crameleon

Category:

Servers hosted in PRG

Target version:

Start date:

2024-06-04

Due date:

% Done:

Estimated time:

Description

Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1:

Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP.
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP

The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise.

On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump).

it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies
=> likely no overload caused from the outside.
it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) and backend services (nginx, httpd, Kanidm, MariaDB).
=> likely not a backend software problem
no network changes were implemented.
the failure always happens simultaneously (in the same second) on both proxies in a given HA pair.
the timeline suggests it having started to happen after the installation of updates last night
=> on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base
=> the only stale files reported by zypper ps were dbus related -> performed reboot of a proxy server, did not help
=> rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not help
simulating the HTTP health checks with curl works fine - even in a while true loop
ping does not show any packet loss

Example graph showing the heavy flapping on hel2 starting today:

https://monitor.opensuse.org/grafana/d/rEqu1u5ue/haproxy?orgId=1&refresh=1m&from=1717420935609&to=1717507335609&var-DS_PROMETHEUS=default&var-host=hel2.infra.opensuse.org&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&viewPanel=185

Actions

Copy link

Updated by crameleon 8 months ago

Private changed from Yes to No

Actions

Copy link

Updated by crameleon 8 months ago

Description updated (diff)

Actions

Copy link

Updated by crameleon 8 months ago · Edited

For debugging, I am using the hel2.i.o.o proxy server (currently the internal backup and also where I attempted the glibc rollback) and the netbox1.i.o.o backend (very low traffic and not yet used for production).

Actions

Copy link

Updated by crameleon 8 months ago

tcpdump on a backend piped into Wireshark shows something like the following around the second HAProxy reports the health check failure:

6580    19:23:55.134169 2a07:de40:b27e:1211::a  2a07:de40:b27e:1203::11 TCP 94  [TCP Retransmission] 443 → 54390 [SYN, ACK] Seq=0 Ack=1 Win=64260 Len=0 MSS=1440 SACK_PERM TSval=2834215675 TSecr=1741091386 WS=128
6581    19:23:55.135931 2a07:de40:b27e:1203::11 2a07:de40:b27e:1211::a  TCP 74  54390 → 443 [RST] Seq=1 Win=0 Len=0

From what I read online, the RST is expected, it is what HAProxy uses to quickly end health checks.
The TCP Retransmission however does not seem quite right.

At one occurrence there was a duplicate ACK as well, but that does not seem to happen as often.

Actions

Copy link

Updated by crameleon 8 months ago

With this information, I also found something I did not spot in the monitoring before - the TCP Retransmission rate on the backend is visibly higher since when the problems started:

https://monitor.opensuse.org/grafana/d/rYdddlPWk/node-metrics?orgId=1&var-DS_PROMETHEUS=default&var-job=nodes&var-node=netbox1.infra.opensuse.org&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=1717349583894&to=1717522383894&viewPanel=104

(the orange part)

Actions

Copy link

Updated by crameleon 8 months ago

It's noticeable on the proxy server as well, but not as much:

https://monitor.opensuse.org/grafana/d/rYdddlPWk/node-metrics?orgId=1&var-DS_PROMETHEUS=default&var-job=nodes&var-node=hel2.infra.opensuse.org&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=1717349692329&to=1717522492329&viewPanel=104

Actions

Copy link

Updated by crameleon 8 months ago

This brings my attention to the firewall, asgard1:

https://monitor.opensuse.org/grafana/d/rYdddlPWk/node-metrics?orgId=1&var-DS_PROMETHEUS=default&var-job=nodes&var-node=asgard1.infra.opensuse.org&var-diskdevices=%5Ba-z%5D%2B%7Cnvme%5B0-9%5D%2Bn%5B0-9%5D%2B%7Cmmcblk%5B0-9%5D%2B&from=1717436163497&to=1717522563497&viewPanel=104

Actions

Copy link

Updated by crameleon 8 months ago

On the hypervisor running the backend and proxy server I am troubleshooting with, as well as on the hypervisor hosting the firewall, no issues with the network bonding are noticeable:

falkor22 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0

orbit20 (Hypervisor):~ # grep -E '^(MII S|Link)' /proc/net/bonding/bond-fib
MII Status: up
MII Status: up
Link Failure Count: 0
MII Status: up
Link Failure Count: 0

Actions

Copy link

Updated by crameleon 8 months ago

On the hypervisor running the firewall, some updates were installed as well:

orbit20 (Hypervisor):~ # grep ^2024-06-04 /var/log/zypp/history
2024-06-04 01:05:35|command|root@orbit20|'zypper' 'patch' '-y' '--auto-agree-with-product-licenses' '--category' 'security'|
2024-06-04 01:05:35|install|glibc|2.31-150300.83.1|x86_64||repo-sle-update|1278b347a1da5f21c19740d4f09cce456252985ae7f6f94302a4d12589f22ece|
2024-06-04 01:05:36|install|glibc-locale-base|2.31-150300.83.1|x86_64||repo-sle-update|782c994c42c27b7c1e8b1cf68ab61ea6a080d1e829594748789d9367fb5fbcc3|
2024-06-04 01:05:39|install|glibc-locale|2.31-150300.83.1|x86_64||repo-sle-update|a5bd9f6d2f87df5ab2eb3adef917fcd26f99eac3079c60e648fb7e2c83729353|
2024-06-04 01:05:39|patch  |openSUSE-SLE-15.5-2024-1895|1|noarch|repo-sle-update|important|security|needed|applied|

orbit20 (Hypervisor):~ # zypper ps -s
The following running processes use deleted files:

PID   | PPID | UID | User       | Command            | Service
------+------+-----+------------+--------------------+----------
1332  | 1    | 499 | messagebus | dbus-daemon        | dbus
20866 | 1    | 0   | root       | virtlogd           | virtlogd
21093 | 1    | 107 | qemu       | qemu-system-x86_64 |
21096 | 1    | 0   | root       | virtlockd          | virtlockd

Actions

Copy link

#10

Updated by crameleon 8 months ago

Did full updates + reboot on firewalls and their hypervisors now (usually I would do this every other weekend, but maybe it helps here).

Actions

Copy link

#11

Updated by crameleon 8 months ago

Status changed from New to In Progress
Assignee set to crameleon

Actions

Copy link

#12

Updated by crameleon 8 months ago

On a first glance (~20 minutes after the update/reboot dance), this seems to have done the trick. Monitoring shows no retransmissions since and neither did the health checks flap. But going to continue actively monitoring for a while ...

Actions

Copy link

#13

Updated by crameleon 8 months ago

Status changed from In Progress to Resolved

That seems to have done it, but two backends (discourse01 and netbox1) needed the same.

The graphs show mostly flat RetransSegs curves again (minimal spikes on proxies, none on firewalls and backends).

Alerts are unmuted again.

It would have been great to understand what the real cause was (glibc update, dbus, something in the host kernel, something in the guest kernel, QEMU, ...) but I suppose I'll never find out.

Certainly one of the more peculiar issues.

Actions

Copy link

#14

Updated by crameleon 7 months ago

Status changed from Resolved to New

It is happening again today ...

Actions

Copy link

#15

Updated by crameleon 7 months ago · Edited

Same network behavior can be observed (increased amount of RetransSeg spikes):

Actions

Copy link

#16

Updated by crameleon 7 months ago · Edited

Curiously, this time no updates were installed in the last 1-2 days on any of the involved machines (proxy servers, backends, hypervisors - only the firewalls and their hypervisors received their 15.6 upgrade yesterday, but including a full reboot).

Actions

Copy link

#17

Updated by crameleon 7 months ago

Status changed from New to In Progress

Installed perf tools and run a tmux with tcpretrans -s on hel2 and netbox1.

Actions

Copy link

#18

Updated by crameleon 5 months ago

Status changed from In Progress to Workable

Lately not observed, but needs investigation if the retrans behavior is actually gone.

Actions

Copy link

#19

Updated by crameleon 3 months ago

Behavior is still present, submitted https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2115 to lower the alerting level until this is solved. Once it is, the patch should be reverted.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openSUSE admin

Tags

Custom queries

tickets #161774

Flapping HAProxy health checks

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago · Edited

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 7 months ago

Updated by crameleon 7 months ago · Edited

Updated by crameleon 7 months ago · Edited

Updated by crameleon 7 months ago

Updated by crameleon 5 months ago

Updated by crameleon 3 months ago