Project

General

Profile

action #108869

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

Missing (re-)schedules of SLE maintenance tests size:M

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-03-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See https://suse.slack.com/archives/C02D16TCP99/p1648110330160679

From Jozef Pupava

There are updates with no aggregates or record in http://dashboard.qam.suse.de/blocked S:M:23303:267916 S:M:23302:267917 S:M:23311:267930 S:M:23085:267929 ...
http://dashboard.qam.suse.de/incident/23302
http://dashboard.qam.suse.de/incident/23303
http://dashboard.qam.suse.de/incident/23311
http://dashboard.qam.suse.de/incident/23085 (edited)

There is some issue, bot is not running jobs on (I guess) resubmited updates ?
e.g. samba
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP2&build=%3A23309%3Asamba&groupid=310
What is the staged status ?
http://dashboard.qam.suse.de/incident/23309

chrony update was rejected by @Paolo Stivanin but looking on the HA test the update is not added there so the test didn't fail due to regression https://bugzilla.suse.com/show_bug.cgi?id=1194220#c32
https://openqa.suse.de/tests/8379382#settings
Below @Paolo Stivanin mentioned kernel S:M:23280:268126 (edited)

Another case where aggregates are failing because there is update which was released 2 days ago!
https://openqa.suse.de/tests/8380595#step/zypper_ref/3

Today's run is worthless, does not contain new updates and is running with released updates, repos are deleted
I guess same for yesterday and maybe even days before. (edited)

Acceptance criteria

  • AC1: It is known what existing workflows require without needing any new features (Existing workflows to schedule incident and aggregate tests are ok again)
  • AC2: Potential new feature requests have been identified and documented in new tickets

Suggestions

  • Could be related to, or a regression from #103701 / https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/46
  • Talk to jpupava, e.g. in the Slack discussion mentioned above
  • Try to find out what is actually broken
  • Try to separate regressions from new feature requests which should go into separate tickets
  • Try to separate "something is missing" cases from "something is failing" cases

Related issues

Related to QA - action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RRResolved2021-12-08

Copied to QA - action #108944: 5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:MResolved2022-03-24

History

#1 Updated by okurz 3 months ago

  • Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added

#2 Updated by cdywan 3 months ago

  • Subject changed from Missing (re-)schedules of SLE maintenance tests to Missing (re-)schedules of SLE maintenance tests size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by jbaier_cz 3 months ago

At least the "missing" part should be solved, it was related to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 after all. There was a leftover --dry parameter in the smelt-sync job.

#4 Updated by osukup 3 months ago

  • Status changed from Workable to Resolved
  • Assignee set to osukup

missed --dry parameter in Sync SMELT worflow , so no updated / real data needed for rest of dashboard aviable

form logs:

$ count=0 # collapsed multi-line command
++ count=0
++ ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] --debug --dry smelt-sync
++ tee bot_smelt-sync_0.log
INFO: Loaded 195 active incidents

and

  'packages': ['sle-module-containers-release'],
  'project': 'SUSE:Maintenance:23017',
  'rr_number': 266265}]
INFO: Dry run, nothing synced

Gitlab job parameters fixed by removing --dry from BOT_PARAMS variable

#5 Updated by okurz 3 months ago

Awesome that you could fix it. I think we can still think of an improvement.

#6 Updated by okurz 3 months ago

  • Status changed from Resolved to Feedback

So same as for other incidents with bigger impacts we should look for at least an improvement on top of the original problem resolution, see https://progress.opensuse.org/projects/qa/wiki/Tools#How-we-work-on-our-backlog . I recommend to conduct a "Five Why"-session. Also cleanup is needed so that we ensure all affected jobs are properly labeled, retriggered with correct parameters, etc.

#7 Updated by osukup 3 months ago

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

#8 Updated by dzedro 3 months ago

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

#9 Updated by osukup 3 months ago

dzedro wrote:

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

anybody with access to gitlab.suse.de :D I checked logs in 5 minutes of start my work and identified problem

#10 Updated by cdywan 3 months ago

  • Copied to action #108944: 5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M added

#11 Updated by okurz 3 months ago

osukup wrote:

dzedro wrote:

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

anybody with access to gitlab.suse.de :D I checked logs in 5 minutes of start my work and identified problem

I agree. I am sure we benefit from teaching each other to help with resolving problems much more than finger-pointing :)

#12 Updated by osukup 3 months ago

  • Status changed from Feedback to Resolved

5-Why's conducted 31.3 + followup actions coming

#13 Updated by okurz 3 months ago

  • Parent task set to #91646

Also available in: Atom PDF