action #109310: qem-bot/dashboard - mixed old and new incidents size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #109310

closed

QA (public) - coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

QA (public) - coordination #109641: [epic] qem-bot improvements

qem-bot/dashboard - mixed old and new incidents size:M

Added by osukup about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

kraih

Category:

Feature requests

Target version:

Ready

Start date:

2022-03-31

Due date:

% Done:

Estimated time:

Description

Observation¶

Maintenance sometimes re-uses old incidents instead of creating new ones for package which leads to mixed results in dashboard :(

see: https://suse.slack.com/archives/C02D16TCP99/p1648721562205869

So we need workaround/solution for this corner case

See also https://github.com/openSUSE/qem-dashboard/issues/61

Originally brought up by coolo in
https://suse.slack.com/archives/C02D16TCP99/p1638283633141300

I just noticed a rather alarming issue: http://dashboard.qam.suse.de/incident/20989 talks about 43 passed, 1 failed jobs for the incident

Problems¶

http://dashboard.qam.suse.de/incident/20639 references "208 passed, 4 failed, 12 stopped" and a link to openQA results https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc but the openQA test results only show 183 passed and 18 soft-failed
- -> dashboard should not say "passed" when it means "passed+softfailed" but "ok", see https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Jobs/Constants.pm#L76=
- -> Consider using time-fixed links, e.g. https://openqa.suse.de/tests/overview?build=%3A20639%3Aopensc&t=2022-04-01+08%3A53%3A19+%2B0000
- -> Ensure that the results are current and correspond to what openQA sees itself (numbers should match)
- -> Exclude any results that are outside a "reasonable time range", e.g. http://dashboard.qam.suse.de/blocked for 20639 shows incident results from some months ago, build 2021…

Acceptance criteria¶

AC1: It is possible to reuse incidents and qem-bot can still approve releated release requests

Suggestions¶

Read the qem-dashboard schema to understand where important settings are stored in https://github.com/openSUSE/qem-dashboard/ , in particular https://github.com/openSUSE/qem-dashboard/blob/main/migrations/dashboard.sql
Read the proper manual process as "Workaround" and for us to understand (further down)
Just delete all aggregate openQA data in qem-dashboard older than configurable, but default 90 days

Workarounds¶

Ask maintenance to create a new, fresh incident, e.g. by a comment in IBS
Detect invalid requests e.g. with outdates results and reject them
Manually delete

Something along the lines of

ssh root@qam2.suse.de
machinectl shell postgresql
sudo -u postgres psql dashboard_db
(wreak havok in here)

SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
(store update_settings)

DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL X
DELETE FROM update_openqa_settings WHERE id in `stored update_settings`

Related issues 5 (2 open — 3 closed)

Actions

Copy link

Updated by osukup about 3 years ago

Project changed from QA (public) to openQA Project (public)

Actions

Copy link

Updated by okurz about 3 years ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by osukup about 3 years ago

reported also as https://github.com/openSUSE/qem-dashboard/issues/61

Actions

Copy link

Updated by osukup about 3 years ago

Problem here is cased by using INCIDENT as common identificator. Unfortunately this is sometimes reused.

We use INCIDENT because we schedule jobs in Testing queue with RRiD ( SUSE:Maintenace:INCIDENT:ReviewRequest ) and also Staging queue which haven't ReviewRequest ( SUSE:Maintenance:INCIDENT ) and we want tests in staging to be valid also after RR is created and incident is moved to testing queue.

most proper solution will be create new identifier
simple workaround - automatic data deletion based on age of data ( +- results older than month in aggregates are useless and can cause this problem)

Actions

Copy link

Updated by osukup about 3 years ago

we have timestamp in results ... so we can pretty simply create cron job which cleanup results :D

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
Description updated (diff)
Category changed from Regressions/Crashes to Support
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

Updated by okurz about 3 years ago

Due date set to 2022-04-22
Status changed from In Progress to Feedback

I asked in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419

Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)

If there is no objection until 2022-04-22 then I suggest we ask openQA test reviewers to reject according submissions and look into automatic rejection.

Actions

Copy link

#10

Updated by okurz about 3 years ago

Subject changed from qem-bot/dashboard - mixed old and new incidents size:M to qem-bot/dashboard - mixed old and new incidents
Description updated (diff)
Due date deleted (~~2022-04-22~~)
Category changed from Support to Feature requests
Status changed from Feedback to New
Assignee deleted (~~okurz~~)
Priority changed from Normal to Low

I asked maintenance experts in https://suse.slack.com/archives/C02CCRM8946/p1649327149714419 if they are ok if we reject such submissions and ask for resubmission as new incidents with a unique number. And with me stating the "harshest" option first I could spawn some quite helpful answers :)

Hi, I have a question regarding submissions for incidents which kinda "reuse" old incident numbers, see https://progress.opensuse.org/issues/109310 for details about the problem that this brings. Does anyone have objections (and suggestions) if we (automatically) reject submissions for old incidents that existed in before? The effect would be that anyone creating new submissions would need to start new incidents (I hope I got the terms right here)

Marina Latini and Simon Lees explaining when an incident is re-used:

Marina Latini: we are "reusing" old incidents only if really needed. we don't really use old parked incidents randomly. what you call reuse can be an incident with several resubmissions and where we had an initial declined/revoked RR for example.
we have also the case of re-releases of already released incidents and for those it's really wrong to create a new incident.
Simon Lees: the main place i've used them is if we release say SLE-15-SP3 with a regression but havent released for older codestreams then generally we will fix the regression in the older codestreams in the original incident rather then creating a new one (obviously for streams that are released we create a new one)

Oliver Kurz and Stephan Kulow: explaining that the impact of the issue is low so far:

Oliver Kurz: ok, I will see what we can do. @Stephan Kulow can you say why are we running into that problem reported now? I don't think that this is a change in maintenance processes. So it's either that qem-bot introduced a regression vs. older tooling or people have ignored the missing support for months/years
Stephan Kulow: As Marina mentioned, it's not done often - and we had this case in the past. But @Jozef Pupava just ignored old results (or they passed and there was nothing to ignore)

My suggestion and from Stephan Kulow:

Oliver Kurz: ok, so both you would say that incident_id+release_request_id should be enough to make it unique?
Stephan Kulow: The bot very well knows that the incident is bound to which RR - and if the RR changes, it needs to delete/invalidate the old data

So my open question: How easy would it be to implement that suggestion and where to start?

Actions

Copy link

#11

Updated by coolo about 3 years ago

the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.

Actions

Copy link

#12

Updated by okurz about 3 years ago

Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added

Actions

Copy link

#13

Updated by okurz about 3 years ago

I see that there was already an attempt which looks like it intended to address the same issue: https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/46/diffs#14610756e1f5900260e8e8ecf7249d18a0fc7a5c_74_76

coolo wrote:

the dashboard can trigger that cleanup when it gets new swamp data and notices an update of the RR.

ok, sounds good.

Actions

Copy link

#14

Updated by okurz about 3 years ago

Parent task set to #109641

Actions

Copy link

#15

Updated by okurz about 3 years ago

Copied to action #109974: qem-bot/dashboard - mixed old and new incidents - potential future ideas added

Actions

Copy link

#16

Updated by okurz about 3 years ago

Subject changed from qem-bot/dashboard - mixed old and new incidents to qem-bot/dashboard - mixed old and new incidents size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#17

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

#18

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

#19

Updated by osukup about 3 years ago

it can be shortened to 2x DELETE queries:

DELETE FROM update_openqa_settings WHERE id IN (SELECT update_settings FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days');
DELETE FROM openqa_jobs WHERE update_settings is not NULL AND updated < NOW() - INTERVAL '90 days')

Actions

Copy link

#20

Updated by osukup about 3 years ago

Status changed from Workable to In Progress
Assignee set to osukup

Actions

Copy link

#21

Updated by kraih about 3 years ago

Opened a PR: https://github.com/openSUSE/qem-dashboard/pull/63

Actions

Copy link

#22

Updated by kraih about 3 years ago

Assignee changed from osukup to kraih

Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.

Actions

Copy link

#23

Updated by osukup about 3 years ago

kraih wrote:

Stealing this ticket from Ondrej to keep an eye on it, since the proposed PR is now being deployed.

one minute before mine action to forward to you and setting to feedback. Your solution is beautiful

Actions

Copy link

#24

Updated by kraih about 3 years ago

Status changed from In Progress to Feedback

Actions

Copy link

#25

Updated by kraih almost 3 years ago

Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added

Actions

Copy link

#26

Updated by kraih almost 3 years ago

Status changed from Feedback to Resolved

I think this ticket is resolved, but there is more work to be done, so i've made a followup ticket with more cleanup requirements that have come up in the meantime. #110409

Actions

Copy link

#27

Updated by jbaier_cz over 2 years ago

Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added

Actions

Copy link

#28

Updated by jbaier_cz about 1 year ago

Related to action #155206: [qem-bot] re-release update can miss repo and thus not schedule updates added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #109310

qem-bot/dashboard - mixed old and new incidents size:M

Observation¶

Problems¶

Acceptance criteria¶

Suggestions¶

Workarounds¶

Updated by osukup about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by osukup about 3 years ago

Updated by osukup about 3 years ago

Updated by osukup about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by coolo about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by osukup about 3 years ago

Updated by osukup about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih about 3 years ago

Updated by osukup about 3 years ago

Updated by kraih about 3 years ago

Updated by kraih almost 3 years ago

Updated by kraih almost 3 years ago

Updated by jbaier_cz over 2 years ago

Updated by jbaier_cz about 1 year ago