action #73174
closedopenQA Project (public) - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[osd][alert] Job age (scheduled) (median) alert
0%
Description
Observation¶
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id: <osd-admins.suse.de>
Date: 09/10/2020 10.06
*/[Alerting] Job age (scheduled) (median) alert/*
Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration Related progress issue: https://progress.opensuse.org/issues/65975
*Metric name*
*Value*
50% percentile (median)
57675.000
I already cancelled some misconfigured s390x-kvm jobs and talked to @geor about this as he was involved. Now there still seem to be many jobs pending, in particular ppc64le older than a day already.
Updated by okurz about 4 years ago
- Status changed from New to Feedback
- Assignee set to okurz
- Priority changed from Urgent to High
In the period https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1&from=1602254435495&to=1602265656851 the alert turned to green and back to pending in the meantime. The problem was mainly caused by two consecutive SLE15SP3 builds. It seems like the down prioritizing of previous builds while still scheduled works as the later build was apparently preferred in multiple scenarios that I looked into. This explains that we have some jobs which are more than 1 day old already but matching workers are still working on according jobs. As I can't identify a single worker class as the offender. Though s390x-kvm-sle15 seems to be a candidate that is quite under pressure.
I paused the alert, hence reducing prio, and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/374 to bump the alerting threshold.
Updated by okurz about 4 years ago
- Priority changed from High to Low
threshold is higher but again we exceed it. Have paused alerts again.
The problem are scenarios like sle-15-SP3-Online-aarch64-prj2_host_upgrade_sles12sp5_to_developing_kvm@virt-arm-64bit-ipmi-machine where each run about 5h (!) just to be retriggered again by some seemingly not so efficient approach linking to something about ipmi instabilities by virtualization team and also linked to github PR https://github.com/SUSE/qa-automation/pull/741 which does not explain a lot. This I see as a big potential for improvement but I wonder what we can do about this except for excluding anything from the "Virtualization" job groups. So far I obviously failed to convince that automated tests should be about 90% green and only fail under very limited circumstances :D
Setting to "Low" as I don't think we will find a good approach to solve this soon. Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...
Updated by okurz about 4 years ago
okurz wrote:
[…] Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...
and this just happened the past days because of network problems causing very slow network transfer. I adressed the topic of long running virtualization tests to QE PrjM maritawerner and we will bring it up 2020-10-26 in a meeting with QE Virt.
EDIT: 2020-10-26: Attended meeting with "QE Virtualization" test to clarify some requirements. "Acceptance" should finish within 24h, "Milestone" can take longer. We could not come up with a better suggestion than just excluding Virtualization tests for us.
Other suggestions I have:
- Use simple scenarios scenarios running on our default qemu based workers to test primary requirements and prevent long-running bare-metal test runs failing on trivial issues, e.g. package conflicts, etc., schedule other longer-running tests after the simple ones
- Use nested-virt, not to verify SLE features directly but to check simple requirements first before going into more complicated scenarios
- Use true text-based consoles to avoid slow VNC connections
- Use autoyast to speedup installation
- Split test scenarios to reuse installations, similar in how the kernel team does that
Updated by okurz about 4 years ago
- Parent task set to #80142
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
merged, haven't triggered yet. Ok as last-resort for now.