action #158059
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #108209: [epic] Reduce load on OSD
OSD unresponsive or significantly slow for some minutes 2024-03-26 13:34Z
0%
Description
Observation¶
Another instance of unresponsiveness on 2024-03-26 1334Z, https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1711459478659&to=1711461053083 . A sudden unresponsiveness with indicators in before. CPU usage and other parameters moderate and getting low after the unresponsiveness started. In the system journal on OSD the first obvious symptom:
Mar 26 14:34:50 openqa telegraf[27132]: 2024-03-26T13:34:50Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: Get "https://openqa.suse.de/admin/influxdb/jobs": context deadline exceeded
In before a lot of
Updating seen of worker …
And some few
Mar 26 14:34:23 openqa openqa-livehandler-daemon[28668]: [debug] client disconnected: …
…
Mar 26 14:34:23 openqa openqa-livehandler-daemon[28668]: [debug] client connected: …
Could this have to do with long-running live session connections exhausting some worker pool?
In /var/log/error_log it looks like there are every couple of minutes some timeout messages but a surge after the incident start. And certain errors during the outage. From grep 'Mar 26 14:' error_log | grep -v 'AH01110' | sed 's/client [^]]*/:masked:/'
:
[Tue Mar 26 14:38:39.904559 2024] [proxy_http:error] [pid 31986] (70007)The timeout specified has expired: [:masked:] AH01102: error reading status line from remote server localhost:9526, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:38:39.904677 2024] [proxy:error] [pid 31986] [:masked:] AH00898: Error reading from remote server returned by /tests/13875288/streaming, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:38:39.906407 2024] [negotiation:error] [pid 31986] [:masked:] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:38:42.788536 2024] [proxy_http:error] [pid 20896] (70007)The timeout specified has expired: [:masked:] AH01102: error reading status line from remote server localhost:9526, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:38:42.788653 2024] [proxy:error] [pid 20896] [:masked:] AH00898: Error reading from remote server returned by /tests/13875288/streaming, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:38:42.789293 2024] [negotiation:error] [pid 20896] [:masked:] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var, referer: https://openqa.suse.de/tests/13875288
…
[Tue Mar 26 14:39:15.460642 2024] [negotiation:error] [pid 16619] [:masked:] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:39:16.699839 2024] [proxy_http:error] [pid 16323] (70007)The timeout specified has expired: [:masked:] AH01102: error reading status line from remote server localhost:9526, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:39:16.699908 2024] [proxy:error] [pid 16323] [:masked:] AH00898: Error reading from remote server returned by /tests/13875288/liveterminal, referer: https://openqa.suse.de/tests/13875288
[Tue Mar 26 14:39:16.700229 2024] [negotiation:error] [pid 16323] [:masked:] AH00690: no acceptable variant: /usr/share/apache2/error/HTTP_BAD_GATEWAY.html.var, referer: https://openqa.suse.de/tests/13875288
The above look like symptoms that we can use to improve our monitoring at least. Or should we really just switch to nginx in before?
Files