Project

General

Profile

Actions

tickets #161774

open

Flapping HAProxy health checks

Added by crameleon 26 days ago. Updated 12 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Servers hosted in PRG
Target version:
-
Start date:
2024-06-04
Due date:
% Done:

0%

Estimated time:

Description

Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1:

Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP.
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP

The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise.

On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump).

  • it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies
    => likely no overload caused from the outside.

  • it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) and backend services (nginx, httpd, Kanidm, MariaDB).
    => likely not a backend software problem

  • no network changes were implemented.

  • the failure always happens simultaneously (in the same second) on both proxies in a given HA pair.

  • the timeline suggests it having started to happen after the installation of updates last night
    => on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base
    => the only stale files reported by zypper ps were dbus related -> performed reboot of a proxy server, did not help
    => rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not help

  • simulating the HTTP health checks with curl works fine - even in a while true loop

  • ping does not show any packet loss

Example graph showing the heavy flapping on hel2 starting today:

https://monitor.opensuse.org/grafana/d/rEqu1u5ue/haproxy?orgId=1&refresh=1m&from=1717420935609&to=1717507335609&var-DS_PROMETHEUS=default&var-host=hel2.infra.opensuse.org&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&viewPanel=185

Actions

Also available in: Atom PDF