Project

General

Profile

tickets #161774

Updated by crameleon 26 days ago

Since earlier today, the health checks towards various backends on all our HAProxy servers are flapping. This is reported by monitoring and follows errors logged by HAProxy - here is just one example, from two backends on hel1: 

 ``` 
 Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]    (9051) : Health check for server kanidm/kani1 failed, reason: Layer6 timeout, check duration: 2001ms, status: 2/3 UP. 
 Jun 04 16:45:45 hel1 haproxy[9051]: [WARNING]    (9051) : Health check for server netbox/netbox1 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP 
 ``` 

 The errors seem to vary between "Layer{4,5,6,7} timeout", making them not very precise. 

 On the relevant backend servers, no connection attempt is observed, however this was only validated with our LDAP backend, as our HTTP backends are configured to discard health check logging (TODO: tcpdump). 

 - it happens both on public (atlas{1,2}) and on internal (hel{1,2}) proxies 
   => likely no overload caused from the outside. 

 
 - it happens across all kinds of protocols (HTTP, HTTPS, LDAP, MySQL) LDAP) and backend services (nginx, httpd, Kanidm, MariaDB). MariaDB, Kanidm) 
   => likely not a backend software problem 

 
 - no network changes were implemented. 

 
 - the failure always happens simultaneously (in the same second) on both proxies in a given HA pair. 

 
 - the timeline suggests it having started to happen after the installation of updates last night 
   => on the proxy server and the firewall, the installed updates were: glibc, glibc-locale, glibc-locale base 
   => the only stale files reported by `zypper ps` were dbus related -> performed reboot of a proxy server, did not help 
   => rolled back glibc on a proxy server (from 2.31-150300.83.1 to to 2.31-150300.74.1), did not help 

 
 - simulating the HTTP health checks with `curl` works fine - even in a `while true` loop 

 
 - ping does not show any packet loss 

 Example graph showing the heavy flapping on hel2 starting today: 

 https://monitor.opensuse.org/grafana/d/rEqu1u5ue/haproxy?orgId=1&refresh=1m&from=1717420935609&to=1717507335609&var-DS_PROMETHEUS=default&var-host=hel2.infra.opensuse.org&var-backend=All&var-frontend=All&var-server=All&var-code=All&var-interval=30s&viewPanel=185

Back