action #157081
closedopenQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project - coordination #108209: [epic] Reduce load on OSD
OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z
0%
Description
Observation¶
As seen in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710230445416&to=1710235845416
Also reported in https://suse.slack.com/archives/C02CANHLANP/p1710232779791019 #eng-testing by Jose Gomez and in https://suse.slack.com/archives/C02CSAZLAR4/p1710233069225929 #team-lsg-qe-core by Richard Fan.
For the time period 083500Z-085100Z the HTTP response of OSD is significantly above the expected range. During that timeframe CPU usage and CPU load are very low, no significant change in memory, minion workers and most graphs.
Storage I/O requests, bytes and times show a significant reduction around 083630Z.
Network usage shows a decrease from 2-8GB/s "out" and 50-800Mb/s "in" down to near-zero especially for "out" starting 083630Z.
In the hour before the incident 073500Z-083500Z I see an increase of the "Requests per second" on the apache server which peaks out exactly on 083500Z before going down again after the unresponsiveness started.
Files
Updated by okurz 8 months ago
https://jira.suse.com/browse/ENGINFRA-1893 would help to find out if there is any related network problem.
Another unresponsiveness window happened in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710241294233&to=1710244273544
Updated by okurz 8 months ago ยท Edited
Nothing suspicious in system journal, nothing suspicious in /var/log/openqa_scheduler. In /var/log/apache2/access_log a lot of requests from qem-bot at the same time like
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 1257 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62547
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=ppc64le HTTP/1.1" 200 972 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62554
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=s390x HTTP/1.1" 200 1025 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62560
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=aarch64 HTTP/1.1" 200 1117 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62573
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=aarch64 HTTP/1.1" 200 1059 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62571
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 4019 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62619
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 952 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62601
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=x86_64 HTTP/1.1" 200 2815 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62622
which could be linked to https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442 "sync incidents" getting openQA test results for many incidents.
wget -O - https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442/raw | grep -c 'Getting openQA tests results '
counts 2561 requests which all seem to be not exactly light-weight. We could try to artifically run those requests in an even more heavy way trying to break openQA.
At times also a lot of asset requests and zypper curl calls like
2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/xz-5.4.6-15
0600.1.15.ppc64le.rpm HTTP/1.1" 200 316348 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/ruby-common-
2.1-3.15.noarch.rpm HTTP/1.1" 200 36752 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/python3-dbm-
3.6.15-150300.10.54.1.ppc64le.rpm HTTP/1.1" 200 68456 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/scout-0.2.7+2
0230124.b4e3468-150500.1.1.noarch.rpm HTTP/1.1" 200 88548 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/xkeyboard-co
nfig-2.40-150600.1.1.noarch.rpm HTTP/1.1" 200 440824 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/logrotate-3
.18.1-150400.3.7.1.ppc64le.rpm HTTP/1.1" 200 85192 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
but I suspect that those are not a problem as they should be small and direct file requests unlike qem-bot requesting job details which need corresponding database queries.
Updated by openqa_review 8 months ago
- Due date set to 2024-03-27
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 8 months ago
- Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Updated by okurz 7 months ago
- Copied to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added