action #73174: [osd][alert] Job age (scheduled) (median) alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #73174

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #178243: [epic] More efficient handling of big job schedules, not executable jobs, never matching worker classes, etc.

[osd][alert] Job age (scheduled) (median) alert

Added by okurz over 4 years ago. Updated 3 months ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-09

Due date:

% Done:

Estimated time:

Tags:

alert, osd, job age

Description

Observation¶

From:	Grafana <osd-admins@suse.de>
To:	osd-admins@suse.de
Sender:	osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:	<osd-admins.suse.de>
Date:	09/10/2020 10.06

*/[Alerting] Job age (scheduled) (median) alert/* 

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration Related progress issue: https://progress.opensuse.org/issues/65975 

*Metric name* 
*Value* 
50% percentile (median) 
57675.000

see
http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1

I already cancelled some misconfigured s390x-kvm jobs and talked to @geor about this as he was involved. Now there still seem to be many jobs pending, in particular ppc64le older than a day already.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from New to Feedback
Assignee set to okurz
Priority changed from Urgent to High

In the period https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1&from=1602254435495&to=1602265656851 the alert turned to green and back to pending in the meantime. The problem was mainly caused by two consecutive SLE15SP3 builds. It seems like the down prioritizing of previous builds while still scheduled works as the later build was apparently preferred in multiple scenarios that I looked into. This explains that we have some jobs which are more than 1 day old already but matching workers are still working on according jobs. As I can't identify a single worker class as the offender. Though s390x-kvm-sle15 seems to be a candidate that is quite under pressure.

I paused the alert, hence reducing prio, and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/374 to bump the alerting threshold.

Actions

Copy link

Updated by okurz over 4 years ago

Priority changed from High to Low

threshold is higher but again we exceed it. Have paused alerts again.

The problem are scenarios like sle-15-SP3-Online-aarch64-prj2_host_upgrade_sles12sp5_to_developing_kvm@virt-arm-64bit-ipmi-machine where each run about 5h (!) just to be retriggered again by some seemingly not so efficient approach linking to something about ipmi instabilities by virtualization team and also linked to github PR https://github.com/SUSE/qa-automation/pull/741 which does not explain a lot. This I see as a big potential for improvement but I wonder what we can do about this except for excluding anything from the "Virtualization" job groups. So far I obviously failed to convince that automated tests should be about 90% green and only fail under very limited circumstances :D

Setting to "Low" as I don't think we will find a good approach to solve this soon. Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

Actions

Copy link

Updated by okurz over 4 years ago

okurz wrote:

[…] Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

and this just happened the past days because of network problems causing very slow network transfer. I adressed the topic of long running virtualization tests to QE PrjM maritawerner and we will bring it up 2020-10-26 in a meeting with QE Virt.

EDIT: 2020-10-26: Attended meeting with "QE Virtualization" test to clarify some requirements. "Acceptance" should finish within 24h, "Milestone" can take longer. We could not come up with a better suggestion than just excluding Virtualization tests for us.

Other suggestions I have:

Use simple scenarios scenarios running on our default qemu based workers to test primary requirements and prevent long-running bare-metal test runs failing on trivial issues, e.g. package conflicts, etc., schedule other longer-running tests after the simple ones
Use nested-virt, not to verify SLE features directly but to check simple requirements first before going into more complicated scenarios
Use true text-based consoles to avoid slow VNC connections
Use autoyast to speedup installation
Split test scenarios to reuse installations, similar in how the kernel team does that