Project

General

Profile

Actions

action #164733

closed

coordination #99303: [saga][epic] Future improvements for SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #155671: [epic] Better handling of SLE maintenance test review

qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA size:M

Added by okurz 5 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
2024-07-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2892821#L54 shows

2024-07-31 07:03:25 INFO     Found failed, not-ignored job https://openqa.suse.de/t14779563 for incident 34532

even though in https://openqa.suse.de/tests/14779563 it's visible that the job was "force_result'd" as part of https://openqa.suse.de/tests/14779563#comment-1538601 already on 2024-07-02. The most recent "sync incidents" job https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/2892922 does not mention 14779563.

Expected result

Suggestions

  • Ask test reviewers about examples
    • Look for jobs that are softfailed via a force_result label and at the same time still failed on the dashboard.
  • Check how the "sync incidents" works on already finished results. Maybe already finished results are only revisited if an AMQP event for a new comment is received and that event could have gone missed so the existing result is never revisited?
  • To reproduce a qem-dashboard database dump is available on qam2.suse.de within the "postgresql" machine in /root/dashboard_db-2024-07-31T09:43:21+02:00.sql.xz
  • Setup dashboard/bot/openQA locally and simulate how the dashboard/bot behave if an already finished job changes its result
  • Look at the qem-bot code to check for any obvious problems with handling softfailed jobs

Related issues 1 (0 open1 closed)

Related to QA (public) - action #157204: Sync openQA job removal events to qem-dashboard listening to AMQP events size:MResolvedjbaier_cz2024-03-14

Actions
Actions #1

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 5 months ago

  • Related to action #157204: Sync openQA job removal events to qem-dashboard listening to AMQP events size:M added
Actions #3

Updated by okurz 5 months ago

  • Parent task set to #155671
Actions #4

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #5

Updated by livdywan 2 months ago

  • Description updated (diff)
Actions #6

Updated by livdywan 2 months ago

  • Subject changed from qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA to qem-dashboard (and hence qem-bot) see a job as failed even though it's marked as softfailed since > 30 days in openQA size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #7

Updated by mkittler 2 months ago

  • Target version changed from Tools - Next to Ready
Actions #8

Updated by jbaier_cz 2 months ago

  • Assignee set to jbaier_cz
Actions #9

Updated by jbaier_cz 2 months ago

Let's create a data point here, at this moment (from the latest bot approval pipeline https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/3322469) updates are blocked by 12 failing openQA tests. None of them is soft-failed or ignored with a comment, so everything behave as expected so far.

Actions #10

Updated by jbaier_cz about 2 months ago

Another datapoint, right now we have some soft-failed test (via automatic force_result) like https://openqa.suse.de/tests/15939391; the dashboard show correctly no red results and there is no trace for halted approval due to test failure in the bot pipeline either.

Actions #11

Updated by jbaier_cz about 1 month ago

  • Status changed from Workable to Resolved

Again, no soft-failed job blocking a release. I even managed to once more verify, that the "acceptable_for" feature is working as intended:

2024-11-25 15:05:51 INFO     Ignoring failed job https://openqa.suse.de/t15996419 for incident 36467 due to openQA comment
...
2024-11-25 15:05:52 INFO     Incidents to approve:
2024-11-25 15:05:52 INFO     * SUSE:Maintenance:36467:353830
2024-11-25 15:05:52 INFO     Accepting review for SUSE:Maintenance:36467:353830

Considering the age of this ticket, I believe we already improved the workflow in the mean time so everything is working as expected. Hence I am marking this as resolved unless someone points out a new examples with the current code base.

Actions

Also available in: Atom PDF