Project

General

Profile

Actions

action #123286

closed

Bot and dashboard reference to wrong data and block update approval size:M

Added by mgrifalconi almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Start date:
2022-12-21
Due date:
% Done:

0%

Estimated time:

Description

Observation

Hello, there is some inconsistency with the dashboard data about 27130:dragonbox

Link of the red SLE 15 SP4 box in blocked page points to https://openqa.suse.de/tests/overview?build=%3A27130%3Afixmath&distri=sle&groupid=439
with no failures

Link inside the update request page http://dashboard.qam.suse.de/incident/27130 points to a different incidents results https://openqa.suse.de/tests/overview?build=%3A27130%3Alibmwaw with this time a failure

Bot approval job log:

 2023-01-17 08:05:34 INFO     Found failed, not-ignored job 10166069 for incident 27130

Interestingly enough, I restarted the month-old job and now even that is green.
But still, the bot does not like it and keeps the 'box' red.
https://openqa.suse.de/tests/10166069
even if its clone is green: https://openqa.suse.de/tests/10331221

Problem

The problem here seems to be that the incident 27130 was modified multiple times and references multiple package as visible in https://smelt.suse.de/incident/27130/

Acceptance criteria

  • AC1: The dashboard page and all links to openQA tests from dashboard reference the same consistent package(s) or no package at all, i.e. no "dragonbox" in dashboard but then pointing to "libmwaw" in openQA

Suggestions

  • Investigate if this is maybe just a display issue and in that case fix it
  • Ask mgrifalconi to update the ticket according to our ticket templates to help us understand what he really expects because we are not clear about that
  • Reconsider how we test maintenance requests before a release request is created while still supporting the "shift left" endeavour
  • Check if the data in the dashboard database regarding packages is consistent with SMELT (to rule out qem-bot involvement)

Related issues 2 (1 open1 closed)

Related to QA (public) - action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RRResolvedosukup2021-12-08

Actions
Related to QA (public) - action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approverFeedbackmgrifalconi2022-12-21

Actions
Actions #1

Updated by mgrifalconi almost 2 years ago

  • Copied from action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver added
Actions #2

Updated by mgrifalconi almost 2 years ago

  • Copied from deleted (action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver)
Actions #3

Updated by mgrifalconi almost 2 years ago

  • Related to action #103701: Resubmited incident (ID) with new release request (RR) inherits incident test results from previous RR added
Actions #4

Updated by mgrifalconi almost 2 years ago

  • Related to action #122311: Use live openQA test results instead of inconsistent qem-dashboard database in qem-bot approver added
Actions #5

Updated by mgrifalconi almost 2 years ago

  • Description updated (diff)
Actions #6

Updated by okurz almost 2 years ago

@mgrifalconi did you keep yourself as assignee and the target version "future" ?

Actions #7

Updated by mgrifalconi almost 2 years ago

  • Target version deleted (future)

Sorry, cloned from another ticket and missed to clean some stuff

Actions #8

Updated by okurz almost 2 years ago

  • Target version set to future

you still have that assigned to you though :)

Regardless for now we should actually keep that in "future" as long as you consider the priority "Normal". I might misjudge the impact though. Feel free to update the description accordingly.

Actions #9

Updated by mgrifalconi almost 2 years ago

  • Assignee deleted (mgrifalconi)
  • Priority changed from Normal to High

Let's see it this way:

You have the chance to investigate the issue at the moment, since the update is still stuck and the only reason it is stuck is because of this bug.

It will either be manually approved or stay there forever. It does not look an urgent update to me, but at some point it should be released.

If I manually approve it today, I fear we will close this ticket with 'cannot reproduce' and the issue will come back again. What if this bug instead of wrongly blocking an update it does the opposite and wrongly approves it, based on old successful test results?

While we are discussing this. Why all components related to maintenance approval (openQA-bot-dashboard), use as unique-id something that is not unique?
The smelt incident (first number in S:M:XXXXX:YYYYYY ) can be reused for multiple requests on ibs (second number) and the ibs id is the thing that actually gets approved. - Please let me know if this should be argument for another ticket.

In summary, I propose a high prio, or whatever prio makes you look at it before manually approving the update or blocking it for months :)

Actions #10

Updated by dzedro almost 2 years ago

Latest conversation regarding blocked S:M:27039:290488 gdb update due to record of deleted failed tests. https://suse.slack.com/archives/C02CANHLANP/p1677160357608009

Actions #11

Updated by okurz almost 2 years ago

  • Target version changed from future to Ready
Actions #12

Updated by mkittler almost 2 years ago

  • Subject changed from Bot and dashboard reference to wrong data and block update approval to Bot and dashboard reference to wrong data and block update approval size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #13

Updated by okurz almost 2 years ago

  • Tags set to qem-dashboard, qem-bot, maintenance
  • Target version deleted (Ready)
Actions #14

Updated by kraih almost 2 years ago

  • Assignee set to kraih
  • Target version set to Ready
Actions #15

Updated by kraih almost 2 years ago

  • Status changed from Workable to In Progress
Actions #16

Updated by kraih almost 2 years ago

As expected, the database shows the correct packages for the incident:

dashboard_db=# select * from incidents where number = 27130;
   id    | number | rr_number |        project         | approved | emu | active |                packages                 | review | review_qam
---------+--------+-----------+------------------------+----------+-----+--------+-----------------------------------------+--------+------------
 6805498 |  27130 |    288026 | SUSE:Maintenance:27130 | f        | f   | f      | {dragonbox,fixmath,libmwaw,libreoffice} | t      | f
(1 row)

But it appears that the build for the incident specific openQA jobs does not match the link shown in the ui (:27130:libmwaw vs :27130:fixmath):

dashboard_db=# select id, flavor, version, settings::json->'BUILD' as build from incident_openqa_settings where incident = 6805498 order by id desc;
   id    |            flavor            | version |      build
---------+------------------------------+---------+------------------
 1965661 | Server-DVD-Incidents-Install | 15-SP4  | ":27130:fixmath"
 1965659 | Server-DVD-Incidents         | 15-SP4  | ":27130:fixmath"
 1965605 | Leap-DVD-Incidents           | 15.4    | ":27130:fixmath"
 1965600 | Leap-DVD-Incidents           | 15.3    | ":27130:fixmath"
(4 rows)
Actions #17

Updated by kraih almost 2 years ago

It appears the current logic for selecting the build is wrong because it does not consider reused incidents:

dashboard_db=# SELECT build FROM incident_openqa_settings ios JOIN openqa_jobs oj ON oj.incident_settings=ios.id WHERE incident=6805498 LIMIT 1;
     build
----------------
 :27130:libmwaw
(1 row)

If we select the build from the most recently updated job it looks much better:

dashboard_db=# SELECT build FROM incident_openqa_settings ios JOIN openqa_jobs oj ON oj.incident_settings=ios.id WHERE incident=6805498 ORDER BY updated DESC LIMIT 1;
     build
----------------
 :27130:fixmath
(1 row)
Actions #18

Updated by openqa_review almost 2 years ago

  • Due date set to 2023-03-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by kraih almost 2 years ago

And now to dig a bit deeper, the openQA job information in the database appears to be inconsistent:

dashboard_db=# select job_id, status, build, updated from openqa_jobs where incident_settings = 1965661;
  job_id  | status |     build      |            updated
----------+--------+----------------+-------------------------------
 10166081 | passed | :27130:libmwaw | 2022-12-20 05:46:41.911796+01
 10308218 | passed | :27130:fixmath | 2023-02-14 07:27:48.905192+01
(2 rows)

dashboard_db=# select job_id, status, build, updated from openqa_jobs where incident_settings = 1965659;
  job_id  | status |     build      |            updated
----------+--------+----------------+-------------------------------
 10166070 | passed | :27130:libmwaw | 2022-12-20 05:46:42.465077+01
 10166079 | passed | :27130:libmwaw | 2022-12-20 05:46:42.558824+01
 10166068 | passed | :27130:libmwaw | 2022-12-20 05:46:42.435259+01
 10166069 | failed | :27130:libmwaw | 2022-12-20 05:46:42.452106+01
 10166074 | passed | :27130:libmwaw | 2022-12-20 05:46:42.510241+01
 10166076 | passed | :27130:libmwaw | 2022-12-20 05:46:42.53002+01
 10166075 | passed | :27130:libmwaw | 2022-12-20 05:46:42.519336+01
 10166078 | passed | :27130:libmwaw | 2022-12-20 05:46:42.54922+01
 10166072 | passed | :27130:libmwaw | 2022-12-20 05:46:42.487171+01
 10166073 | passed | :27130:libmwaw | 2022-12-20 05:46:42.497866+01
 10166071 | passed | :27130:libmwaw | 2022-12-20 05:46:42.476512+01
 10166077 | passed | :27130:libmwaw | 2022-12-20 05:46:42.539296+01
 10308237 | passed | :27130:fixmath | 2023-02-14 07:27:49.103583+01
 10308238 | passed | :27130:fixmath | 2023-02-14 07:27:49.11906+01
 10308242 | passed | :27130:fixmath | 2023-02-14 07:27:49.158171+01
 10308240 | passed | :27130:fixmath | 2023-02-14 07:27:49.137776+01
 10324405 | passed | :27130:fixmath | 2023-02-14 07:27:49.18504+01
 10308241 | passed | :27130:fixmath | 2023-02-14 07:27:49.148439+01
 10308235 | passed | :27130:fixmath | 2023-02-14 07:27:49.086601+01
 10308230 | passed | :27130:fixmath | 2023-02-14 07:27:49.06869+01
 10308233 | passed | :27130:fixmath | 2023-02-14 07:27:49.077991+01
 10308236 | passed | :27130:fixmath | 2023-02-14 07:27:49.094744+01
 10308243 | passed | :27130:fixmath | 2023-02-14 07:27:49.167052+01
 10308239 | passed | :27130:fixmath | 2023-02-14 07:27:49.12875+01
 10308244 | passed | :27130:fixmath | 2023-02-14 07:27:49.176443+01
(25 rows)

dashboard_db=# select job_id, status, build, updated from openqa_jobs where incident_settings = 1965605;
 job_id | status | build | updated
--------+--------+-------+---------
(0 rows)

dashboard_db=# select job_id, status, build, updated from openqa_jobs where incident_settings = 1965600;
 job_id | status | build | updated
--------+--------+-------+---------
(0 rows)

It looks like the old jobs from the :27130:libmwaw have not been cleaned up before the incident was reused for :27130:fixmath. Unfortunately the logs on the machine don't go back far enough to check if the cleanup ran at all. I'll have to make the logs a bit less verbose.

Actions #21

Updated by kraih almost 2 years ago

  • Status changed from In Progress to Feedback

Ok, i think i've done everything that can be done for now. Links are fixed, database cleaned up, and logging is changed so we will have more data in the future, should this happen again.

Actions #22

Updated by okurz almost 2 years ago

Did you do changes in code to select the build differently? I only saw a change regarding the link to openQA

Actions #23

Updated by kraih almost 2 years ago

okurz wrote:

Did you do changes in code to select the build differently? I only saw a change regarding the link to openQA

Yes, but as far as i know that is only used to create openQA links. https://github.com/openSUSE/qem-dashboard/commit/be859a81a96b980ce56a6e651b502dae53d8d096

Actions #24

Updated by okurz almost 2 years ago

LGTM

@mgrifalconi can you please check if you are happy with the current state as well?

Actions #25

Updated by mgrifalconi almost 2 years ago

Thanks for working on this topic. I can't say if this is enough to entirely fix the issue since it is a sporadic problem that happens on a series of circumstances we cannot easily replicate.
Maybe if similar issues happen again, you could consider to regularly flush the dashboard database and download new data instead of trying to keep it in sync (if this helps in any way).

Will keep an eye for similar issues in the future, for now I would say we can mark it as resolved!

Actions #26

Updated by okurz almost 2 years ago

  • Due date deleted (2023-03-21)
  • Status changed from Feedback to Resolved

thank you

Actions #27

Updated by kraih almost 2 years ago

mgrifalconi wrote:

Maybe if similar issues happen again, you could consider to regularly flush the dashboard database and download new data instead of trying to keep it in sync (if this helps in any way).

We are kinda already doing that with the incident data, which is completely refreshed in one of the pipelines with data from Smelt. What we currently preserve is openQA job data (incident specific and aggregates), and i'm not sure if it is possible to flush and recreate that from scratch.

Actions #28

Updated by mgrifalconi almost 2 years ago

Hello, documenting here a couple more issues:

http://dashboard.qam.suse.de/incident/28181 shows a failed incident, but incidents links don't.
Bot says that a failed incident is https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064#L149
https://openqa.suse.de/t10689483 which does not exists.

and another:
http://dashboard.qam.suse.de/incident/28144
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458148
2023-03-16 14:33:58 INFO Found failed, not-ignored job https://openqa.suse.de/t10658630 for incident 28144

Also posted on Slack: https://suse.slack.com/archives/C02CANHLANP/p1678977155383529

Actions #29

Updated by livdywan almost 2 years ago

mgrifalconi wrote:

Hello, documenting here a couple more issues:

http://dashboard.qam.suse.de/incident/28181 shows a failed incident, but incidents links don't.
Bot says that a failed incident is https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064#L149
https://openqa.suse.de/t10689483 which does not exists.

and another:
http://dashboard.qam.suse.de/incident/28144
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458148
2023-03-16 14:33:58 INFO Found failed, not-ignored job https://openqa.suse.de/t10658630 for incident 28144

Also posted on Slack: https://suse.slack.com/archives/C02CANHLANP/p1678977155383529

Would you mind filing a new ticket? Since this one was resolved.

Actions #30

Updated by kraih almost 2 years ago

mgrifalconi wrote:

Hello, documenting here a couple more issues:

...

Looks unrelated to this ticket, so i've created a new one: #126167.

Actions

Also available in: Atom PDF