Project

General

Profile

Actions

action #107671

closed

No aggregate maintenance runs scheduled today on osd size:M

Added by mgrifalconi almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Seems a different issue than #106179 since the dashboard is accessible this time.

Link to list aggregate runs of the day:

https://openqa.suse.de/tests/overview?arch=&flavor=&machine=&test=&modules=&module_re=&groupid=366&groupid=308&groupid=232&groupid=165&groupid=280&groupid=218&groupid=108&groupid=54&groupid=405&groupid=412&groupid=411&groupid=369&groupid=352&groupid=353&groupid=357&groupid=355&groupid=354&groupid=358&groupid=370&groupid=348&groupid=349&groupid=351&groupid=356&groupid=375&groupid=376&groupid=397&groupid=414&build=20220228-1#
(This was showing an empty list at that point)

Impact: update approval blocked

Suggestions

  • caused by downtime of http://download.suse.de
  • read suggestions from #105603
  • Some gitlab CI steps are failing but we allow them to fail to let other steps continue, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/886067 "sync smelt" fails but we allow it to fail so that "sync incidents" can continue but we also don't receive an alert about it and there is not sufficient retrying. We could split the steps into separate pipelines, make each step fatal and add configurable number of retries and interval between retries customized for each step in https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/master/.gitlab-ci.yml, e.g. for sync smelt long enough , retrying to cover the weekly SUSE IT maintenance window, less for other critical steps
  • For retrying we do not even need to change qem-bot, we could use just a wrapper in the gitlab CI job itself, e.g. https://github.com/okurz/leaky_bucket_error_count
  • Also look into gitlab CI options to either abort a previous pipeline if a new one is triggered or not start new ones as long as old ones are still running

Related issues 3 (0 open3 closed)

Related to QA (public) - action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:SResolvedosukup2022-02-08

Actions
Related to openQA Infrastructure (public) - action #105603: openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" size:MResolvedjbaier_cz2021-12-16

Actions
Related to openQA Project (public) - action #108824: Some of the daily aggregate tests are cancelled without a reason size:MResolvedokurz2022-03-24

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Related to action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S added
Actions #2

Updated by osukup almost 3 years ago

  • Status changed from New to Resolved

caused by downtime of http://download.suse.de

Actions #3

Updated by okurz almost 3 years ago

  • Status changed from Resolved to Feedback
  • Assignee set to okurz
  • Priority changed from Immediate to High
  • Target version set to Ready

osukup thank you for handling it this urgently. Please keep our process best practices in mind regarding finding at least an additional improvement

Actions #4

Updated by okurz almost 3 years ago

  • Status changed from Feedback to New
  • Assignee deleted (okurz)

ok, maybe it wasn't clear. I asked a question and was waiting for the feedback, hence assigning to me. Because it was never assigned to anyone else I wanted to avoid assigning to individual members in the team. You know, don't shoot the messenger :) But I think I can make the ticket more clear by putting back to "New". Another question is if all members of our team know about our best practice to find at least an additional improvement.

Actions #5

Updated by livdywan almost 3 years ago

  • Subject changed from No aggregate maintenance runs scheduled today on osd to No aggregate maintenance runs scheduled today on osd size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz almost 3 years ago

  • Related to action #105603: openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" size:M added
Actions #7

Updated by okurz almost 3 years ago

  • Description updated (diff)
Actions #8

Updated by jbaier_cz almost 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to jbaier_cz
Actions #9

Updated by jbaier_cz almost 3 years ago

  • Status changed from In Progress to Feedback

The improvement as discussed today in the meeting is implemented inside https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50; please see the more detailed explanation in the MR / commit messages.

Actions #10

Updated by okurz almost 3 years ago

  • Due date set to 2022-03-31
Actions #11

Updated by jbaier_cz almost 3 years ago

The new pipeline system is working, I will be watching it for a few days. We will see how it deals with server errors.

Actions #12

Updated by okurz almost 3 years ago

  • Related to action #108824: Some of the daily aggregate tests are cancelled without a reason size:M added
Actions #13

Updated by okurz almost 3 years ago

Could #108824 be a regression due to this? Could it be that due to splitting calls we call isos post on the openQA API with the OBSOLETE parameter on matching products so that now jobs are cancelled unexpectedly and never rescheduled?

Actions #14

Updated by okurz almost 3 years ago

So it turned out that #108824 was actually a regression due to the work on this ticket. I suggest to monitor this further and at best resolve next week, still long before the due-date, please :)

Actions #15

Updated by jbaier_cz almost 3 years ago

okurz wrote:

So it turned out that #108824 was actually a regression due to the work on this ticket. I suggest to monitor this further and at best resolve next week, still long before the due-date, please :)

No, not this one. #108869 was a regression due to this (the other bot-ng ticket :) ). I agree, we should monitor further.

Actions #16

Updated by okurz almost 3 years ago

  • Due date deleted (2022-03-31)
  • Status changed from Feedback to Resolved

No further problems observed, can be seen as resolved. We have also five why's analysis planned for a related topic so we will think of improvements in this area anyway.

Actions

Also available in: Atom PDF