tickets #161774: Flapping HAProxy health checks - openSUSE admin - openSUSE Project Management Tool

Actions

Copy link

tickets #161774

open

Flapping HAProxy health checks

Added by crameleon 8 months ago. Updated 3 months ago.

Status:

Workable

Priority:

Normal

Assignee:

crameleon

Category:

Servers hosted in PRG

Target version:

Start date:

2024-06-04

Due date:

% Done:

Estimated time:

Description

Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1:

Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP.
Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]  (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP

The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise.

On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump).

it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies
=> likely no overload caused from the outside.
it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) and backend services (nginx, httpd, Kanidm, MariaDB).
=> likely not a backend software problem
no network changes were implemented.
the failure always happens simultaneously (in the same second) on both proxies in a given HA pair.
the timeline suggests it having started to happen after the installation of updates last night
=> on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base
=> the only stale files reported by zypper ps were dbus related -> performed reboot of a proxy server, did not help
=> rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not help
simulating the HTTP health checks with curl works fine - even in a while true loop
ping does not show any packet loss

Example graph showing the heavy flapping on hel2 starting today:

https://monitor.opensuse.org/grafana/d/rEqu1u5ue/haproxy?orgId=1&refresh=1m&from=1717420935609&to=1717507335609&var-DS_PROMETHEUS=default&var-host=hel2.infra.opensuse.org&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&viewPanel=185

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

openSUSE admin

Tags

Custom queries

tickets #161774

Flapping HAProxy health checks

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago · Edited

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 8 months ago

Updated by crameleon 7 months ago

Updated by crameleon 7 months ago · Edited

Updated by crameleon 7 months ago · Edited

Updated by crameleon 7 months ago

Updated by crameleon 5 months ago

Updated by crameleon 3 months ago