action #157081: OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #157081

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #108209: [epic] Reduce load on OSD

OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-03-12

Due date:

% Done:

Estimated time:

Tags:

osd, performance, infra

Description

Observation¶

As seen in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710230445416&to=1710235845416

Also reported in https://suse.slack.com/archives/C02CANHLANP/p1710232779791019 #eng-testing by Jose Gomez and in https://suse.slack.com/archives/C02CSAZLAR4/p1710233069225929 #team-lsg-qe-core by Richard Fan.

For the time period 083500Z-085100Z the HTTP response of OSD is significantly above the expected range. During that timeframe CPU usage and CPU load are very low, no significant change in memory, minion workers and most graphs.
Storage I/O requests, bytes and times show a significant reduction around 083630Z.

Network usage shows a decrease from 2-8GB/s "out" and 50-800Mb/s "in" down to near-zero especially for "out" starting 083630Z.

In the hour before the incident 073500Z-083500Z I see an increase of the "Requests per second" on the apache server which peaks out exactly on 083500Z before going down again after the unresponsiveness started.

Files

Download all files

Screenshot_20240312_114038_osd_slow_2024-03-12-0830Z.png (148 KB) Screenshot_20240312_114038_osd_slow_2024-03-12-0830Z.png	OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z	okurz, 2024-03-12 10:41
Screenshot_20240312_115450_osd_slow_2024-03-12-0830Z_http_response_vs_apache_requests.png (180 KB) Screenshot_20240312_115450_osd_slow_2024-03-12-0830Z_http_response_vs_apache_requests.png	OSD unresponsive vs. apache requests per second	okurz, 2024-03-12 10:56

Screenshot_20240312_114038_osd_slow_2024-03-12-0830Z.png

Screenshot_20240312_115450_osd_slow_2024-03-12-0830Z_http_response_vs_apache_requests.png

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz about 1 year ago

https://jira.suse.com/browse/ENGINFRA-1893 would help to find out if there is any related network problem.

Another unresponsiveness window happened in https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1710241294233&to=1710244273544

Actions

Copy link

Updated by okurz about 1 year ago · Edited

Nothing suspicious in system journal, nothing suspicious in /var/log/openqa_scheduler. In /var/log/apache2/access_log a lot of requests from qem-bot at the same time like

10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 1257 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62547
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=ppc64le HTTP/1.1" 200 972 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62554
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=s390x HTTP/1.1" 200 1025 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62560
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=aarch64 HTTP/1.1" 200 1117 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62573
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=aarch64 HTTP/1.1" 200 1059 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62571
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Server-DVD-Incidents-TERADATA&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 4019 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62619
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents-Install&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP4&arch=x86_64 HTTP/1.1" 200 952 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62601
10.144.53.193 - - [12/Mar/2024:12:18:26 +0100] "GET /api/v1/jobs?scope=relevant&latest=1&flavor=Desktop-DVD-Incidents&distri=sle&build=%3A32819%3Alibsemanage&version=15-SP5&arch=x86_64 HTTP/1.1" 200 2815 "-" "python-OpenQA_Client/qem-bot/1.0.0" 62622

which could be linked to https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442 "sync incidents" getting openQA test results for many incidents.

wget -O - https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2376442/raw | grep -c 'Getting openQA tests results ' counts 2561 requests which all seem to be not exactly light-weight. We could try to artifically run those requests in an even more heavy way trying to break openQA.

At times also a lot of asset requests and zypper curl calls like

2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/xz-5.4.6-15
0600.1.15.ppc64le.rpm HTTP/1.1" 200 316348 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/ruby-common-
2.1-3.15.noarch.rpm HTTP/1.1" 200 36752 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/python3-dbm-
3.6.15-150300.10.54.1.ppc64le.rpm HTTP/1.1" 200 68456 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:423:f5ff:fe3c:2c73 - - [12/Mar/2024:12:21:46 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/scout-0.2.7+2
0230124.b4e3468-150500.1.1.noarch.rpm HTTP/1.1" 200 88548 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0
2a07:de40:b203:12:b416:dcff:fe07:4e5d - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/noarch/xkeyboard-co
nfig-2.40-150600.1.1.noarch.rpm HTTP/1.1" 200 440824 "-" "ZYpp 17.31.15 (curl 8.0.1)" 1
2a07:de40:b203:12:5028:3eff:fe2d:1eb1 - - [12/Mar/2024:12:21:47 +0100] "GET /assets/repo/SLE-15-SP6-Module-Basesystem-POOL-ppc64le-Build62.1-Media1/ppc64le/logrotate-3
.18.1-150400.3.7.1.ppc64le.rpm HTTP/1.1" 200 85192 "-" "ZYpp 17.31.15 (curl 8.0.1)" 0

but I suspect that those are not a problem as they should be small and direct file requests unlike qem-bot requesting job details which need corresponding database queries.

Actions

Copy link

Updated by openqa_review about 1 year ago

Due date set to 2024-03-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #130636: high response times on osd - Try nginx on OSD size:S added

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from In Progress to Feedback

So for now I would not plan any further actions. We already have #130636 as the next step that could be done when it fits our planning. Any other ideas?

Actions

Copy link

Updated by okurz about 1 year ago

Due date deleted (~~2024-03-27~~)
Status changed from Feedback to Resolved

No more problems observed and we agree that the problem is very much the same that should be fixed with #130636

Actions

Copy link

Updated by okurz about 1 year ago

Copied to action #157666: OSD unresponsive and then not starting any more jobs on 2024-03-21 added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #157081

OSD unresponsive or significantly slow for some minutes 2024-03-12 08:30Z

Observation¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago · Edited

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago