action #73174
closed
openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
[osd][alert] Job age (scheduled) (median) alert
Added by okurz over 3 years ago.
Updated over 3 years ago.
Description
Observation¶
From: Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id: <osd-admins.suse.de>
Date: 09/10/2020 10.06
*/[Alerting] Job age (scheduled) (median) alert/*
Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration Related progress issue: https://progress.opensuse.org/issues/65975
*Metric name*
*Value*
50% percentile (median)
57675.000
see
http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1
I already cancelled some misconfigured s390x-kvm jobs and talked to @geor about this as he was involved. Now there still seem to be many jobs pending, in particular ppc64le older than a day already.
- Status changed from New to Feedback
- Assignee set to okurz
- Priority changed from Urgent to High
- Priority changed from High to Low
threshold is higher but again we exceed it. Have paused alerts again.
The problem are scenarios like sle-15-SP3-Online-aarch64-prj2_host_upgrade_sles12sp5_to_developing_kvm@virt-arm-64bit-ipmi-machine where each run about 5h (!) just to be retriggered again by some seemingly not so efficient approach linking to something about ipmi instabilities by virtualization team and also linked to github PR https://github.com/SUSE/qa-automation/pull/741 which does not explain a lot. This I see as a big potential for improvement but I wonder what we can do about this except for excluding anything from the "Virtualization" job groups. So far I obviously failed to convince that automated tests should be about 90% green and only fail under very limited circumstances :D
Setting to "Low" as I don't think we will find a good approach to solve this soon. Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...
okurz wrote:
[…] Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...
and this just happened the past days because of network problems causing very slow network transfer. I adressed the topic of long running virtualization tests to QE PrjM maritawerner and we will bring it up 2020-10-26 in a meeting with QE Virt.
EDIT: 2020-10-26: Attended meeting with "QE Virtualization" test to clarify some requirements. "Acceptance" should finish within 24h, "Milestone" can take longer. We could not come up with a better suggestion than just excluding Virtualization tests for us.
Other suggestions I have:
- Use simple scenarios scenarios running on our default qemu based workers to test primary requirements and prevent long-running bare-metal test runs failing on trivial issues, e.g. package conflicts, etc., schedule other longer-running tests after the simple ones
- Use nested-virt, not to verify SLE features directly but to check simple requirements first before going into more complicated scenarios
- Use true text-based consoles to avoid slow VNC connections
- Use autoyast to speedup installation
- Split test scenarios to reuse installations, similar in how the kernel team does that
- Estimated time set to 80142.00 h
- Estimated time deleted (
80142.00 h)
- Parent task set to #80142
- Status changed from Feedback to Resolved
merged, haven't triggered yet. Ok as last-resort for now.
Also available in: Atom
PDF