Project

General

Profile

Actions

coordination #126167

open

[epic][qem-bot] Inconsistent job counts in qem-dashboard size:M

Added by kraih about 1 year ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
Start date:
2023-03-23
Due date:
2023-04-11 (about 13 months late)
% Done:

67%

Estimated time:
(Total: 0.00 h)

Description

Observation

Reported by @mgrifalconi in #https://progress.opensuse.org/issues/123286#note-28:

http://dashboard.qam.suse.de/incident/28181 shows a failed incident, but incidents links don't.
Bot says that a failed incident is https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064#L149
https://openqa.suse.de/t10689483 which does not exists.

and another:
http://dashboard.qam.suse.de/incident/28144
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458148
2023-03-16 14:33:58 INFO Found failed, not-ignored job https://openqa.suse.de/t10658630 for incident 28144

Acceptance criteria

  • AC1: The dashboard database and openQA database agree on the data shown

Suggestions

  • The non-existing jobs in the qem-bot logs hint at a problem with the bot here, not the dashboard

Subtasks 3 (1 open2 closed)

action #126548: [qem-dashboard] Add an API endpoint to flag openQA jobs as missing in openQA size:MResolvedkraih2023-03-232023-04-11

Actions
action #126551: [qem-bot] Flag missing openQA jobs with qem-dashboard API size:MResolvedmkittler2023-03-23

Actions
action #126554: [qem-dashboard] Show more details about incident specific openQA jobs in dashboard uiNewkraih2023-03-23

Actions
Actions #1

Updated by jbaier_cz about 1 year ago

Also see the related slack conversation: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678977155.383529

There is one more use-case where a deleted job might happen:

yes, I've deleted the two failed ltp_aio_stress jobs because they've been merged into a single runfile in the new LTP release and failed during env setup. I've cloned the correct ltp_aio_stress job manually instead.

So the non-existent jobs just might got deleted by the users. Maybe we want a simple way to delete them in the dashboard or we might document that deleting jobs is not a good idea and should be replaced by force resulting to soft-fail and/or creating ignore for auto-approval comment (feature from #95479)

Actions #2

Updated by kraih about 1 year ago

Lets take a look at what's in the dashboard database:

dashboard_db=# select * from incidents where number = 28181;
   id    | number | rr_number |        project         | approved | emu | active |                                                                   packages                                                                   | review | review_qam
---------+--------+-----------+------------------------+----------+-----+--------+----------------------------------------------------------------------------------------------------------------------------------------------+--------+------------
 7765521 |  28181 |    292112 | SUSE:Maintenance:28181 | f        | f   | t      | {kernel-debug,kernel-default,kernel-docs,kernel-ec2,kernel-obs-build,kernel-obs-qa,kernel-source,kernel-syms,kernel-vanilla,kernel-zfcpdump} | t      | t
(1 row)
dashboard_db=# select id, flavor, version, settings::json->'BUILD' as build from incident_openqa_settings where incident = 7765521 order by id desc;
   id    |                flavor                | version |        build
---------+--------------------------------------+---------+---------------------
 1986102 | Server-DVD-TERADATA-Incidents-Kernel | 12-SP3  | ":28181:kernel-ec2"
 1986101 | Server-DVD-Incidents-TERADATA        | 12-SP3  | ":28181:kernel-ec2"
(2 rows)
dashboard_db=# SELECT oj.id, job_id, status, build, updated FROM incident_openqa_settings ios JOIN openqa_jobs oj ON oj.incident_settings=ios.id WHERE incident=7765521 ORDER BY updated;
    id     |  job_id  | status |       build       |            updated
-----------+----------+--------+-------------------+-------------------------------
 404426218 | 10689482 | failed | :28181:kernel-ec2 | 2023-03-14 11:23:35.780127+01
 404426219 | 10689483 | failed | :28181:kernel-ec2 | 2023-03-14 11:23:35.790424+01
 404426011 | 10689476 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:51.9446+01
 404426213 | 10689477 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.898062+01
 404426214 | 10689478 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.906756+01
 404426215 | 10689479 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.91699+01
 404426216 | 10689480 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.929089+01
 404426217 | 10689481 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.937704+01
 404426220 | 10689484 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.946364+01
 404426221 | 10689485 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.95557+01
 404426222 | 10689486 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.964067+01
 404426223 | 10689487 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.972838+01
 404426224 | 10689488 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.981982+01
 404426225 | 10689489 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.991306+01
 404426226 | 10689490 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.000173+01
 404426227 | 10689491 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.009543+01
 404426228 | 10689492 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.019818+01
 404426229 | 10689493 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.028551+01
 404426230 | 10689494 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.038049+01
 404426231 | 10689495 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.04872+01
 404426232 | 10689496 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.065565+01
 404426233 | 10689497 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.080267+01
 404426234 | 10689498 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.091491+01
 404426235 | 10689499 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.103288+01
 404426236 | 10689500 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.113283+01
 404426237 | 10689501 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.122294+01
 404426238 | 10689502 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.131599+01
 404426239 | 10689503 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.140256+01
 404426240 | 10689504 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.15304+01
 404426241 | 10689505 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.165617+01
 404426242 | 10689506 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.175326+01
 404426243 | 10689507 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.185705+01
 404426244 | 10689508 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.196715+01
 404426245 | 10689509 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.205899+01
 404426246 | 10689510 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.214893+01
 404426247 | 10689511 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.224048+01
 404426248 | 10689512 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.233713+01
 404426249 | 10689513 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.245921+01
 404426250 | 10689514 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.256369+01
 404426251 | 10689515 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.267382+01
 404426252 | 10689516 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.277881+01
 404426253 | 10689517 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.288087+01
 404426254 | 10689518 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.297678+01
 404426255 | 10689519 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.307895+01
 404426256 | 10689520 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.317602+01
 404426257 | 10689521 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.328972+01
 404426258 | 10689522 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.339168+01
 404426259 | 10689523 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.350696+01
 404426260 | 10689524 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.362105+01
 404426261 | 10689525 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.372032+01
 404426262 | 10689526 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.382003+01
 404426263 | 10689527 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.391538+01
 404426264 | 10689528 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.40127+01
 404426265 | 10689529 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.410842+01
 407680891 | 10690141 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.421098+01
(55 rows)
Actions #3

Updated by kraih about 1 year ago

jbaier_cz wrote:

So the non-existent jobs just might got deleted by the users. Maybe we want a simple way to delete them in the dashboard or we might document that deleting jobs is not a good idea and should be replaced by force resulting to soft-fail and/or creating ignore for auto-approval comment (feature from #95479)

That's what it looks like indeed. Should we maybe have an API endpoint in the dashboard like DELETE /api/jobs/<job_id> that the bot calls, since it knows when a job is missing in openQA?

https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064:

2023-03-16 14:04:42 INFO     Job 10689483 not found in openQA
Actions #4

Updated by jbaier_cz about 1 year ago

I would generally agree, the only issue here is that I am not 100% sure that it is ok to delete missing openQA job without any manual intervention. My example case: an incident has two openQA jobs, one will pass and the other one will failed. After some period of time, the failing one get deleted (for example due to retention settings in the job group). Now the incident has only one successful job and will be auto-approved despite the bug indicated by the (now already deleted) job is still there.

Actions #5

Updated by kraih about 1 year ago

jbaier_cz wrote:

I would generally agree, the only issue here is that I am not 100% sure that it is ok to delete missing openQA job without any manual intervention. My example case: an incident has two openQA jobs, one will pass and the other one will failed. After some period of time, the failing one get deleted (for example due to retention settings in the job group). Now the incident has only one successful job and will be auto-approved despite the bug indicated by the (now already deleted) job is still there.

I got the impression that from the reviewer perspective all jobs no longer present in openQA are not considered by them anyway. If they do matter after all then we need a whole new dashboard feature here. Perhaps flag missing jobs as such in the database and present them accordingly in the dashboard ui.

Actions #6

Updated by MDoucha about 1 year ago

I recommend flagging the missing jobs in dashboard. Block autoreview but allow manual approval. Dashboard could also collect some info about the missing jobs from OpenQA audit log, mainly who deleted the jobs and when. The reviewer should then double check whether deleting the jobs was appropriate and either reschedule the missing jobs or approve manually.

Deleting jobs should happen very rarely when we decide to drop some jobs from schedule because they're obsolete and the jobs in question become broken for a few incidents before the removal gets approved and merged.

Actions #7

Updated by kraih about 1 year ago

  • Assignee set to kraih
Actions #8

Updated by kraih about 1 year ago

  • Tags set to reactive work
Actions #9

Updated by okurz about 1 year ago

  • Target version set to Ready
Actions #10

Updated by mgrifalconi about 1 year ago

The direction we are going to with openQA review is to minimize manual actions to minimize mistakes and make the process more efficient but for this special occasion I agree to still require one, considering how rarely it happens and a risk to approve something by mistake.

@MDoucha a comment about: "The reviewer should then double check whether deleting the jobs was appropriate and either reschedule the missing jobs or approve manually."

I agree with that statement only if by "reviewer" you mean your squad internal reviewer, when finding out a RR is blocked (by looking at the dashboard and finds a red box with your squad name).

The "openqa review" should be only a safety net to make sure RR do not rot in the queue when squads fail to do their internal review on time.

Actions #11

Updated by osukup about 1 year ago

  • we really need ability to force reschedule jobs --> some element in UI which forces remove records of already sheduled jobs for incidents or mark them in database and don't serve them to qem-bot in schedule incident run to reschedule tests
Actions #12

Updated by livdywan about 1 year ago

  • Tracker changed from action to coordination
  • Subject changed from [qem-bot] Inconsistent job counts in qem-dashboard to [epic][qem-bot] Inconsistent job counts in qem-dashboard size:M
  • Description updated (diff)
  • Status changed from New to Blocked
Actions #13

Updated by okurz about 1 year ago

Blocked by what?

Actions #14

Updated by kraih about 1 year ago

okurz wrote:

Blocked by what?

In the estimation meeting I promised to make 3 followup tickets that will block this one. And i'm about to start writing them. :)

Actions #15

Updated by kraih about 1 year ago

Blocked by #126548.

Actions #16

Updated by kraih about 1 year ago

Blocked by #126551.

Actions #17

Updated by okurz 10 months ago

  • Status changed from Blocked to New
  • Assignee deleted (kraih)
  • Target version changed from Ready to future

Two subtasks resolved, third is in future

Actions

Also available in: Atom PDF