Project

General

Profile

action #73174

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[osd][alert] Job age (scheduled) (median) alert

Added by okurz 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2020-10-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

From:   Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:    <osd-admins.suse.de>
Date:   09/10/2020 10.06

*/[Alerting] Job age (scheduled) (median) alert/* 

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration Related progress issue: https://progress.opensuse.org/issues/65975 

*Metric name* 
*Value* 
50% percentile (median) 
57675.000

see
http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1

I already cancelled some misconfigured s390x-kvm jobs and talked to geor about this as he was involved. Now there still seem to be many jobs pending, in particular ppc64le older than a day already.

History

#1 Updated by okurz 10 months ago

  • Status changed from New to Feedback
  • Assignee set to okurz
  • Priority changed from Urgent to High

In the period https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1&from=1602254435495&to=1602265656851 the alert turned to green and back to pending in the meantime. The problem was mainly caused by two consecutive SLE15SP3 builds. It seems like the down prioritizing of previous builds while still scheduled works as the later build was apparently preferred in multiple scenarios that I looked into. This explains that we have some jobs which are more than 1 day old already but matching workers are still working on according jobs. As I can't identify a single worker class as the offender. Though s390x-kvm-sle15 seems to be a candidate that is quite under pressure.

I paused the alert, hence reducing prio, and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/374 to bump the alerting threshold.

#2 Updated by okurz 10 months ago

  • Priority changed from High to Low

threshold is higher but again we exceed it. Have paused alerts again.

The problem are scenarios like sle-15-SP3-Online-aarch64-prj2_host_upgrade_sles12sp5_to_developing_kvm@virt-arm-64bit-ipmi-machine where each run about 5h (!) just to be retriggered again by some seemingly not so efficient approach linking to something about ipmi instabilities by virtualization team and also linked to github PR https://github.com/SUSE/qa-automation/pull/741 which does not explain a lot. This I see as a big potential for improvement but I wonder what we can do about this except for excluding anything from the "Virtualization" job groups. So far I obviously failed to convince that automated tests should be about 90% green and only fail under very limited circumstances :D

Setting to "Low" as I don't think we will find a good approach to solve this soon. Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

#3 Updated by okurz 9 months ago

okurz wrote:

[…] Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

and this just happened the past days because of network problems causing very slow network transfer. I adressed the topic of long running virtualization tests to QE PrjM maritawerner and we will bring it up 2020-10-26 in a meeting with QE Virt.

EDIT: 2020-10-26: Attended meeting with "QE Virtualization" test to clarify some requirements. "Acceptance" should finish within 24h, "Milestone" can take longer. We could not come up with a better suggestion than just excluding Virtualization tests for us.

Other suggestions I have:

  • Use simple scenarios scenarios running on our default qemu based workers to test primary requirements and prevent long-running bare-metal test runs failing on trivial issues, e.g. package conflicts, etc., schedule other longer-running tests after the simple ones
  • Use nested-virt, not to verify SLE features directly but to check simple requirements first before going into more complicated scenarios
  • Use true text-based consoles to avoid slow VNC connections
  • Use autoyast to speedup installation
  • Split test scenarios to reuse installations, similar in how the kernel team does that

#4 Updated by okurz 8 months ago

  • Estimated time set to 80142.00 h

#5 Updated by okurz 8 months ago

  • Estimated time deleted (80142.00 h)

#7 Updated by okurz 8 months ago

  • Status changed from Feedback to Resolved

merged, haven't triggered yet. Ok as last-resort for now.

Also available in: Atom PDF