Project

General

Profile

Actions

action #157081

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #108209: [epic] Reduce load on OSD

OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z

Added by okurz about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-12
Due date:
% Done:

0%

Estimated time:

Description

Observation

As seen in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710230445416&to=1710235845416

Screenshot_20240312_114038_osd_slow_2024-03-12-0830Z.png

Also reported in https://suse.slack.com/archives/C02CANHLANP/p1710232779791019 #eng-testing by Jose Gomez and in https://suse.slack.com/archives/C02CSAZLAR4/p1710233069225929 #team-lsg-qe-core by Richard Fan.

For the time period 083500Z-085100Z the HTTP response of OSD is significantly above the expected range. During that timeframe CPU usage and CPU load are very low, no significant change in memory, minion workers and most graphs.
Storage I/O requests, bytes and times show a significant reduction around 083630Z.

Network usage shows a decrease from 2-8GB/s "out" and 50-800Mb/s "in" down to near-zero especially for "out" starting 083630Z.

In the hour before the incident 073500Z-083500Z I see an increase of the "Requests per second" on the apache server which peaks out exactly on 083500Z before going down again after the unresponsiveness started.

Screenshot_20240312_115450_osd_slow_2024-03-12-0830Z_http_response_vs_apache_requests.png


Files


Related issues 2 (1 open1 closed)

Related to openQA Project - action #130636: high response times on osd - Try nginx on OSD size:SWorkable

Actions
Copied to openQA Infrastructure - action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21Resolvedokurz2024-03-12

Actions
Actions #1

Updated by okurz about 2 months ago

https://jira.suse.com/browse/ENGINFRA-1893 would help to find out if there is any related network problem.

Another unresponsiveness window happened in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710241294233&to=1710244273544

Actions #2

Updated by okurz about 2 months ago ยท Edited

Nothing suspicious in system journal, nothing suspicious in /var/log/openqa_scheduler. In /var/log/apache2/access_log a lot of requests from qem-bot at the same time like

10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 1257 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62547
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=ppc64le HTTP/1.1" 200 972 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62554
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=s390x HTTP/1.1" 200 1025 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62560
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=aarch64 HTTP/1.1" 200 1117 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62573
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=aarch64 HTTP/1.1" 200 1059 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62571
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 4019 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62619
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 952 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62601
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=x86_64 HTTP/1.1" 200 2815 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62622

which could be linked to https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442 "sync incidents" getting openQA test results for many incidents.

wget -O - https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442/raw | grep -c 'Getting openQA tests results ' counts 2561 requests which all seem to be not exactly light-weight. We could try to artifically run those requests in an even more heavy way trying to break openQA.

At times also a lot of asset requests and zypper curl calls like

2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/xz-5.4.6-15
0600.1.15.ppc64le.rpm HTTP/1.1" 200 316348 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/ruby-common-
2.1-3.15.noarch.rpm HTTP/1.1" 200 36752 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/python3-dbm-
3.6.15-150300.10.54.1.ppc64le.rpm HTTP/1.1" 200 68456 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/scout-0.2.7+2
0230124.b4e3468-150500.1.1.noarch.rpm HTTP/1.1" 200 88548 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/xkeyboard-co
nfig-2.40-150600.1.1.noarch.rpm HTTP/1.1" 200 440824 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/logrotate-3
.18.1-150400.3.7.1.ppc64le.rpm HTTP/1.1" 200 85192 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0

but I suspect that those are not a problem as they should be small and direct file requests unlike qem-bot requesting job details which need corresponding database queries.

Actions #3

Updated by openqa_review about 2 months ago

  • Due date set to 2024-03-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by okurz about 2 months ago

  • Related to action #130636: high response times on osd - Try nginx on OSD size:S added
Actions #5

Updated by okurz about 2 months ago

  • Status changed from In Progress to Feedback

So for now I would not plan any further actions. We already have #130636 as the next step that could be done when it fits our planning. Any other ideas?

Actions #6

Updated by okurz about 2 months ago

  • Due date deleted (2024-03-27)
  • Status changed from Feedback to Resolved

No more problems observed and we agree that the problem is very much the same that should be fixed with #130636

Actions #7

Updated by okurz about 1 month ago

  • Copied to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added
Actions

Also available in: Atom PDF