coordination #126167
open[epic][qem-bot] Inconsistent job counts in qem-dashboard size:M
67%
Description
Observation¶
Reported by @mgrifalconi in #https://progress.opensuse.org/issues/123286#note-28:
http://dashboard.qam.suse.de/incident/28181 shows a failed incident, but incidents links don't.
Bot says that a failed incident is https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064#L149
https://openqa.suse.de/t10689483 which does not exists.
and another:
http://dashboard.qam.suse.de/incident/28144
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458148
2023-03-16 14:33:58 INFO Found failed, not-ignored job https://openqa.suse.de/t10658630 for incident 28144
Acceptance criteria¶
- AC1: The dashboard database and openQA database agree on the data shown
Suggestions¶
- The non-existing jobs in the qem-bot logs hint at a problem with the bot here, not the dashboard
Updated by jbaier_cz over 1 year ago
Also see the related slack conversation: https://app.slack.com/client/T02863RC2AC/C02CANHLANP/thread/C02CANHLANP-1678977155.383529
There is one more use-case where a deleted job might happen:
yes, I've deleted the two failed ltp_aio_stress jobs because they've been merged into a single runfile in the new LTP release and failed during env setup. I've cloned the correct ltp_aio_stress job manually instead.
So the non-existent jobs just might got deleted by the users. Maybe we want a simple way to delete them in the dashboard or we might document that deleting jobs is not a good idea and should be replaced by force resulting to soft-fail and/or creating ignore for auto-approval comment (feature from #95479)
Updated by kraih over 1 year ago
Lets take a look at what's in the dashboard database:
dashboard_db=# select * from incidents where number = 28181;
id | number | rr_number | project | approved | emu | active | packages | review | review_qam
---------+--------+-----------+------------------------+----------+-----+--------+----------------------------------------------------------------------------------------------------------------------------------------------+--------+------------
7765521 | 28181 | 292112 | SUSE:Maintenance:28181 | f | f | t | {kernel-debug,kernel-default,kernel-docs,kernel-ec2,kernel-obs-build,kernel-obs-qa,kernel-source,kernel-syms,kernel-vanilla,kernel-zfcpdump} | t | t
(1 row)
dashboard_db=# select id, flavor, version, settings::json->'BUILD' as build from incident_openqa_settings where incident = 7765521 order by id desc;
id | flavor | version | build
---------+--------------------------------------+---------+---------------------
1986102 | Server-DVD-TERADATA-Incidents-Kernel | 12-SP3 | ":28181:kernel-ec2"
1986101 | Server-DVD-Incidents-TERADATA | 12-SP3 | ":28181:kernel-ec2"
(2 rows)
dashboard_db=# SELECT oj.id, job_id, status, build, updated FROM incident_openqa_settings ios JOIN openqa_jobs oj ON oj.incident_settings=ios.id WHERE incident=7765521 ORDER BY updated;
id | job_id | status | build | updated
-----------+----------+--------+-------------------+-------------------------------
404426218 | 10689482 | failed | :28181:kernel-ec2 | 2023-03-14 11:23:35.780127+01
404426219 | 10689483 | failed | :28181:kernel-ec2 | 2023-03-14 11:23:35.790424+01
404426011 | 10689476 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:51.9446+01
404426213 | 10689477 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.898062+01
404426214 | 10689478 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.906756+01
404426215 | 10689479 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.91699+01
404426216 | 10689480 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.929089+01
404426217 | 10689481 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.937704+01
404426220 | 10689484 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.946364+01
404426221 | 10689485 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.95557+01
404426222 | 10689486 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.964067+01
404426223 | 10689487 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.972838+01
404426224 | 10689488 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.981982+01
404426225 | 10689489 | passed | :28181:kernel-ec2 | 2023-03-17 14:45:59.991306+01
404426226 | 10689490 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.000173+01
404426227 | 10689491 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.009543+01
404426228 | 10689492 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.019818+01
404426229 | 10689493 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.028551+01
404426230 | 10689494 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.038049+01
404426231 | 10689495 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.04872+01
404426232 | 10689496 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.065565+01
404426233 | 10689497 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.080267+01
404426234 | 10689498 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.091491+01
404426235 | 10689499 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.103288+01
404426236 | 10689500 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.113283+01
404426237 | 10689501 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.122294+01
404426238 | 10689502 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.131599+01
404426239 | 10689503 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.140256+01
404426240 | 10689504 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.15304+01
404426241 | 10689505 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.165617+01
404426242 | 10689506 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.175326+01
404426243 | 10689507 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.185705+01
404426244 | 10689508 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.196715+01
404426245 | 10689509 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.205899+01
404426246 | 10689510 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.214893+01
404426247 | 10689511 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.224048+01
404426248 | 10689512 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.233713+01
404426249 | 10689513 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.245921+01
404426250 | 10689514 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.256369+01
404426251 | 10689515 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.267382+01
404426252 | 10689516 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.277881+01
404426253 | 10689517 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.288087+01
404426254 | 10689518 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.297678+01
404426255 | 10689519 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.307895+01
404426256 | 10689520 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.317602+01
404426257 | 10689521 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.328972+01
404426258 | 10689522 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.339168+01
404426259 | 10689523 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.350696+01
404426260 | 10689524 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.362105+01
404426261 | 10689525 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.372032+01
404426262 | 10689526 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.382003+01
404426263 | 10689527 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.391538+01
404426264 | 10689528 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.40127+01
404426265 | 10689529 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.410842+01
407680891 | 10690141 | passed | :28181:kernel-ec2 | 2023-03-17 14:46:00.421098+01
(55 rows)
Updated by kraih over 1 year ago
jbaier_cz wrote:
So the non-existent jobs just might got deleted by the users. Maybe we want a simple way to delete them in the dashboard or we might document that deleting jobs is not a good idea and should be replaced by force resulting to soft-fail and/or creating ignore for auto-approval comment (feature from #95479)
That's what it looks like indeed. Should we maybe have an API endpoint in the dashboard like DELETE /api/jobs/<job_id>
that the bot calls, since it knows when a job is missing in openQA?
https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1458064:
2023-03-16 14:04:42 INFO Job 10689483 not found in openQA
Updated by jbaier_cz over 1 year ago
I would generally agree, the only issue here is that I am not 100% sure that it is ok to delete missing openQA job without any manual intervention. My example case: an incident has two openQA jobs, one will pass and the other one will failed. After some period of time, the failing one get deleted (for example due to retention settings in the job group). Now the incident has only one successful job and will be auto-approved despite the bug indicated by the (now already deleted) job is still there.
Updated by kraih over 1 year ago
jbaier_cz wrote:
I would generally agree, the only issue here is that I am not 100% sure that it is ok to delete missing openQA job without any manual intervention. My example case: an incident has two openQA jobs, one will pass and the other one will failed. After some period of time, the failing one get deleted (for example due to retention settings in the job group). Now the incident has only one successful job and will be auto-approved despite the bug indicated by the (now already deleted) job is still there.
I got the impression that from the reviewer perspective all jobs no longer present in openQA are not considered by them anyway. If they do matter after all then we need a whole new dashboard feature here. Perhaps flag missing jobs as such in the database and present them accordingly in the dashboard ui.
Updated by MDoucha over 1 year ago
I recommend flagging the missing jobs in dashboard. Block autoreview but allow manual approval. Dashboard could also collect some info about the missing jobs from OpenQA audit log, mainly who deleted the jobs and when. The reviewer should then double check whether deleting the jobs was appropriate and either reschedule the missing jobs or approve manually.
Deleting jobs should happen very rarely when we decide to drop some jobs from schedule because they're obsolete and the jobs in question become broken for a few incidents before the removal gets approved and merged.
Updated by mgrifalconi over 1 year ago
The direction we are going to with openQA review is to minimize manual actions to minimize mistakes and make the process more efficient but for this special occasion I agree to still require one, considering how rarely it happens and a risk to approve something by mistake.
@MDoucha a comment about: "The reviewer should then double check whether deleting the jobs was appropriate and either reschedule the missing jobs or approve manually."
I agree with that statement only if by "reviewer" you mean your squad internal reviewer, when finding out a RR is blocked (by looking at the dashboard and finds a red box with your squad name).
The "openqa review" should be only a safety net to make sure RR do not rot in the queue when squads fail to do their internal review on time.
Updated by osukup over 1 year ago
- we really need ability to force reschedule jobs --> some element in UI which forces remove records of already sheduled jobs for incidents or mark them in database and don't serve them to qem-bot in schedule incident run to reschedule tests
Updated by livdywan over 1 year ago
- Tracker changed from action to coordination
- Subject changed from [qem-bot] Inconsistent job counts in qem-dashboard to [epic][qem-bot] Inconsistent job counts in qem-dashboard size:M
- Description updated (diff)
- Status changed from New to Blocked
Updated by kraih over 1 year ago
okurz wrote:
Blocked by what?
In the estimation meeting I promised to make 3 followup tickets that will block this one. And i'm about to start writing them. :)
Updated by okurz over 1 year ago
- Status changed from Blocked to New
- Assignee deleted (
kraih) - Target version changed from Ready to future
Two subtasks resolved, third is in future