action #73174: [osd][alert] Job age (scheduled) (median) alert - openQA Infrastructure - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #73174

closed

openQA Project - coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

[osd][alert] Job age (scheduled) (median) alert

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project - Ready

Start date:

2020-10-09

Due date:

% Done:

Estimated time:

Tags:

alert, osd, job age

Description

Observation¶

From:   Grafana <osd-admins@suse.de>
To: osd-admins@suse.de
Sender: osd-admins <osd-admins-bounces+okurz=suse.de@suse.de>
List-Id:    <osd-admins.suse.de>
Date:   09/10/2020 10.06

*/[Alerting] Job age (scheduled) (median) alert/* 

Check for overall decrease of "time to start". Possible reasons for regression: * Not enough ressources * Too many tests scheduled due to misconfiguration Related progress issue: https://progress.opensuse.org/issues/65975 

*Metric name* 
*Value* 
50% percentile (median) 
57675.000

see
http://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1

I already cancelled some misconfigured s390x-kvm jobs and talked to @geor about this as he was involved. Now there still seem to be many jobs pending, in particular ppc64le older than a day already.

History
Notes
Property changes

Actions

Copy link

Updated by okurz almost 4 years ago

Status changed from New to Feedback
Assignee set to okurz
Priority changed from Urgent to High

In the period https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?fullscreen&edit&tab=alert&panelId=5&orgId=1&from=1602254435495&to=1602265656851 the alert turned to green and back to pending in the meantime. The problem was mainly caused by two consecutive SLE15SP3 builds. It seems like the down prioritizing of previous builds while still scheduled works as the later build was apparently preferred in multiple scenarios that I looked into. This explains that we have some jobs which are more than 1 day old already but matching workers are still working on according jobs. As I can't identify a single worker class as the offender. Though s390x-kvm-sle15 seems to be a candidate that is quite under pressure.

I paused the alert, hence reducing prio, and created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/374 to bump the alerting threshold.

Actions

Copy link

Updated by okurz almost 4 years ago

Priority changed from High to Low

threshold is higher but again we exceed it. Have paused alerts again.

The problem are scenarios like sle-15-SP3-Online-aarch64-prj2_host_upgrade_sles12sp5_to_developing_kvm@virt-arm-64bit-ipmi-machine where each run about 5h (!) just to be retriggered again by some seemingly not so efficient approach linking to something about ipmi instabilities by virtualization team and also linked to github PR https://github.com/SUSE/qa-automation/pull/741 which does not explain a lot. This I see as a big potential for improvement but I wonder what we can do about this except for excluding anything from the "Virtualization" job groups. So far I obviously failed to convince that automated tests should be about 90% green and only fail under very limited circumstances :D

Setting to "Low" as I don't think we will find a good approach to solve this soon. Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

Actions

Copy link

Updated by okurz over 3 years ago

okurz wrote:

[…] Likely we will come back to this issue as alerts are paused for a long time until something else breaks and we do not realize because the alert was paused, oh well ...

and this just happened the past days because of network problems causing very slow network transfer. I adressed the topic of long running virtualization tests to QE PrjM maritawerner and we will bring it up 2020-10-26 in a meeting with QE Virt.

EDIT: 2020-10-26: Attended meeting with "QE Virtualization" test to clarify some requirements. "Acceptance" should finish within 24h, "Milestone" can take longer. We could not come up with a better suggestion than just excluding Virtualization tests for us.

Other suggestions I have:

Use simple scenarios scenarios running on our default qemu based workers to test primary requirements and prevent long-running bare-metal test runs failing on trivial issues, e.g. package conflicts, etc., schedule other longer-running tests after the simple ones
Use nested-virt, not to verify SLE features directly but to check simple requirements first before going into more complicated scenarios
Use true text-based consoles to avoid slow VNC connections
Use autoyast to speedup installation
Split test scenarios to reuse installations, similar in how the kernel team does that

Actions

Copy link