Project

General

Profile

Actions

action #108869

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

Missing (re-)schedules of SLE maintenance tests size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-03-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See https://suse.slack.com/archives/C02D16TCP99/p1648110330160679

From Jozef Pupava

There are updates with no aggregates or record in http://dashboard.qam.suse.de/blocked S:M:23303:267916 S:M:23302:267917 S:M:23311:267930 S:M:23085:267929 ...
http://dashboard.qam.suse.de/incident/23302
http://dashboard.qam.suse.de/incident/23303
http://dashboard.qam.suse.de/incident/23311
http://dashboard.qam.suse.de/incident/23085 (edited)

There is some issue, bot is not running jobs on (I guess) resubmited updates ?
e.g. samba
https://openqa.suse.de/tests/overview?distri=sle&version=15-SP2&build=%3A23309%3Asamba&groupid=310
What is the staged status ?
http://dashboard.qam.suse.de/incident/23309

chrony update was rejected by @Paolo Stivanin but looking on the HA test the update is not added there so the test didn't fail due to regression https://bugzilla.suse.com/show_bug.cgi?id=1194220#c32
https://openqa.suse.de/tests/8379382#settings
Below @Paolo Stivanin mentioned kernel S:M:23280:268126 (edited)

Another case where aggregates are failing because there is update which was released 2 days ago!
https://openqa.suse.de/tests/8380595#step/zypper_ref/3

Today's run is worthless, does not contain new updates and is running with released updates, repos are deleted
I guess same for yesterday and maybe even days before. (edited)

Acceptance criteria

  • AC1: It is known what existing workflows require without needing any new features (Existing workflows to schedule incident and aggregate tests are ok again)
  • AC2: Potential new feature requests have been identified and documented in new tickets

Suggestions

  • Could be related to, or a regression from #103701 / https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/46
  • Talk to jpupava, e.g. in the Slack discussion mentioned above
  • Try to find out what is actually broken
  • Try to separate regressions from new feature requests which should go into separate tickets
  • Try to separate "something is missing" cases from "something is failing" cases

Related issues 2 (0 open2 closed)

Related to QA - action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RRResolvedosukup2021-12-08

Actions
Copied to QA - action #108944: 5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:MResolvedosukup2022-03-24

Actions
Actions #1

Updated by okurz over 2 years ago

  • Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added
Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from Missing (re-)schedules of SLE maintenance tests to Missing (re-)schedules of SLE maintenance tests size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by jbaier_cz over 2 years ago

At least the "missing" part should be solved, it was related to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 after all. There was a leftover --dry parameter in the smelt-sync job.

Actions #4

Updated by osukup over 2 years ago

  • Status changed from Workable to Resolved
  • Assignee set to osukup

missed --dry parameter in Sync SMELT worflow , so no updated / real data needed for rest of dashboard aviable

form logs:

$ count=0 # collapsed multi-line command
++ count=0
++ ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] --debug --dry smelt-sync
++ tee bot_smelt-sync_0.log
INFO: Loaded 195 active incidents

and


  'packages': ['sle-module-containers-release'],
  'project': 'SUSE:Maintenance:23017',
  'rr_number': 266265}]
INFO: Dry run, nothing synced

Gitlab job parameters fixed by removing --dry from BOT_PARAMS variable

Actions #5

Updated by okurz over 2 years ago

Awesome that you could fix it. I think we can still think of an improvement.

Actions #6

Updated by okurz over 2 years ago

  • Status changed from Resolved to Feedback

So same as for other incidents with bigger impacts we should look for at least an improvement on top of the original problem resolution, see https://progress.opensuse.org/projects/qa/wiki/Tools#How-we-work-on-our-backlog . I recommend to conduct a "Five Why"-session. Also cleanup is needed so that we ensure all affected jobs are properly labeled, retriggered with correct parameters, etc.

Actions #7

Updated by osukup over 2 years ago

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

Actions #8

Updated by dzedro over 2 years ago

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

Actions #9

Updated by osukup over 2 years ago

dzedro wrote:

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

anybody with access to gitlab.suse.de :D I checked logs in 5 minutes of start my work and identified problem

Actions #10

Updated by livdywan over 2 years ago

  • Copied to action #108944: 5 whys follow-up to Missing (re-)schedules of SLE maintenance tests size:M added
Actions #11

Updated by okurz over 2 years ago

osukup wrote:

dzedro wrote:

osukup wrote:

Probably biggest delay in identification of problem was --> nobody checked all related logs in gitlab

With nobody you mean you, jbaier or tools ?

anybody with access to gitlab.suse.de :D I checked logs in 5 minutes of start my work and identified problem

I agree. I am sure we benefit from teaching each other to help with resolving problems much more than finger-pointing :)

Actions #12

Updated by osukup over 2 years ago

  • Status changed from Feedback to Resolved

5-Why's conducted 31.3 + followup actions coming

Actions #13

Updated by okurz over 2 years ago

  • Parent task set to #91646
Actions

Also available in: Atom PDF