action #117619: Bot approved update request with failing tests size:M - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #117619

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #117694: [epic] Stable and reliable qem-bot

Bot approved update request with failing tests size:M

Added by mgrifalconi over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

tinita

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Observation¶

Incident https://smelt.suse.de/incident/25982/
Request that was approved by sle-qam-openqa: https://build.suse.de/request/show/280720
Bot job: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1166058#L279
INFO: SUSE:Maintenance:25982:280720

Failing test: https://openqa.suse.de/tests/9642631#settings
Dashboard: https://dashboard.qam.suse.de/incident/25982

Context on slack: https://suse.slack.com/archives/C02CANHLANP/p1665043765153419

Acceptance criteria¶

AC1: We know the reason why the bot approved the request and didn't see the test failure

Suggestions¶

Run ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] inc-approve --dry (see https://github.com/openSUSE/qem-bot/#usage for more info)
Look into the dashboard logs on qam2.suse.de journalctl -u dashboard.service
Note: The journal only goes back 3 days currently (Oct 3), so for the incident in question it's too late. Consider increasing the journal size as a first step
Consider adding code that only runs the bot on a single incident

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Target version set to Ready

Actions

Copy link

Updated by tinita over 2 years ago

Subject changed from Bot approved update request with failing tests to Bot approved update request with failing tests size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by tinita over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by tinita over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by tinita over 2 years ago

The journal was volatile on qam2, not yet stored in /var/log/journal.
We just created that directory and called journalctl --flush, and now the logs appear there.

@mgrifalconi unfortunately the logs for the incident you mentioned are already gone. Logs only start at Oct 3, but in the future logs should be kept longer.
Can you inform us of other incidents like that? Thanks!

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Noting down additional idea, we should do a cross-check of data in the dashboard to find out if there are other incidents which are missing some results from aggregate testing (maybe only for the particular job group).

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added

Actions

Copy link

Updated by tinita over 2 years ago

We found something that might be related. When looking for builds in the dashboard database, it stopped at 20220906-1 and started again at 20221003-1.

select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202209%' order by build;
    id     |   build    |                concat                | status  
-----------+------------+--------------------------------------+---------
 281542711 | 20220901-1 | https://openqa.suse.de/tests/9429052 | passed
 281542717 | 20220901-1 | https://openqa.suse.de/tests/9429058 | passed
 281543266 | 20220901-1 | https://openqa.suse.de/tests/9429194 | passed
...
 285460574 | 20220906-1 | https://openqa.suse.de/tests/9464999 | passed
 285460533 | 20220906-1 | https://openqa.suse.de/tests/9464953 | passed
 285460204 | 20220906-1 | https://openqa.suse.de/tests/9464896 | passed
 285460513 | 20220906-1 | https://openqa.suse.de/tests/9464932 | passed
 285460129 | 20220906-1 | https://openqa.suse.de/tests/9464817 | passed
 285460160 | 20220906-1 | https://openqa.suse.de/tests/9464850 | passed
(1634 rows)

select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202210%' order by build;
    id     |   build    |                concat                | status  
-----------+------------+--------------------------------------+---------
 301562369 | 20221003-1 | https://openqa.suse.de/tests/9650349 | passed
 301562354 | 20221003-1 | https://openqa.suse.de/tests/9650334 | passed
 301562346 | 20221003-1 | https://openqa.suse.de/tests/9650315 | passed
 302810831 | 20221005-1 | https://openqa.suse.de/tests/9667078 | passed
 302808116 | 20221005-1 | https://openqa.suse.de/tests/9665956 | passed
...

(4228 rows)

So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?

Actions

Copy link

Updated by okurz over 2 years ago

Related to action #117655: Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M added

Actions

Copy link

#10

Updated by okurz over 2 years ago

Parent task set to #117694

Actions

Copy link

#11

Updated by tinita over 2 years ago

Status changed from Workable to In Progress
Assignee set to tinita

Actions

Copy link

#12

Updated by openqa_review over 2 years ago

Due date set to 2022-10-25

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by okurz over 2 years ago

Discussed in weekly unblock 2022-10-12. We agreed that it is a good idea to extend qem-bot to be started in dry-run mode against a single incident for testing purposes. In parallel we have planned to extend the openQA API in #117655.

Actions

Copy link

#14

Updated by kraih over 2 years ago

tinita wrote:

So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?

I don't, we'll have to talk to Ondrej about that. In the timeframe there were a lot of non-aggregate jobs being updated though:

select count(id) from openqa_jobs where updated <= '2022-10-03' and updated >= '2022-09-06';
 count
-------
 20085
(1 row)

Actions

Copy link

#15

Updated by kraih over 2 years ago

I've added a log message to the dashboard so we can verify in the journal which jobs the bot submitted. https://github.com/openSUSE/qem-dashboard/commit/68e792075023a01efa921c6cffbe0cb709c8be5b

Actions

Copy link

#16

Updated by kraih over 2 years ago

Looking through the openqa_jobs table, i noticed large gaps in the id column:

select id, build, concat('https://openqa.suse.de/tests/', job_id), status, updated from openqa_jobs where updated <= '2022-10-03' order by id desc;
    id     |                         build                         |                concat                | status  |            updated
-----------+-------------------------------------------------------+--------------------------------------+---------+-------------------------------
...
293755646 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583234 | passed  | 2022-09-30 12:47:21.021526+02
 293755645 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583233 | passed  | 2022-09-30 12:47:21.008596+02
 293755644 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583232 | passed  | 2022-09-30 12:47:20.999165+02
 293754505 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583119 | passed  | 2022-09-25 10:46:41.028612+02
 293754504 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583118 | passed  | 2022-09-25 10:46:41.015559+02
 293754503 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583117 | passed  | 2022-09-25 10:46:41.006355+02
...

That's almost a thousand ids missing between 293755644 and 293754505. Now my first though was of course that the cleanup code is still misbehaving, but given that #114694 has decent test coverage, it's rather unlikely now. However, looking at the code that adds jobs to the table, it is an INSERT ... ON CONFLICT... DO UPDATE ... query. That means it can use up generated ids without ever adding a new record to the table. Perhaps we should start questioning that logic! Maybe it skips adding jobs that should be present.

Actions

Copy link

#17

Updated by tinita over 2 years ago

I got venv and finally pytest working.
I had to set

export OSC_PLUGIN_FAIL_IGNORE=1

before installing the requirements.

I got the bot running with this:

./bot-ng.py -c /etc/openqabot --token abc --dry  -c tests/fixtures/config/ inc-approve

and was able to replace the incident list with just a single one manually.
Now I'll try to add an option.

Actions

Copy link

#18

Updated by kraih over 2 years ago

I've reopened #114694, since the original problem has resurfaced. This ticket is probably related after all. And i will be taking more drastic action now, the cleanup feature in question will be disabled completely.

Actions

Copy link

#19

Updated by tinita over 2 years ago

I added the option (-I / --incident), but I'm still struggling with writing a test.
https://github.com/perlpunk/qem-bot/tree/approve-single-incident

Actions

Copy link

#20

Updated by kraih over 2 years ago

If we are lucky this problem is now also resolved with the removal of the dashboard cleanup feature.

Actions

Copy link

#21

Updated by tinita over 2 years ago

https://github.com/openSUSE/qem-bot/pull/77 - Add option --incident

I had the tests working at noon, but then I had to rebase to master, and had to mock some more things due to the recent changes.

Actions

Copy link

#22

Updated by tinita over 2 years ago

Status changed from In Progress to Feedback

https://github.com/openSUSE/qem-bot/pull/77 was merged

Waiting for #114694 as it might be the same problem.

Actions

Copy link

#23

Updated by tinita over 2 years ago

Due date changed from 2022-10-25 to 2022-10-28

Setting due date the same as #114694

Actions

Copy link

#24

Updated by okurz over 2 years ago

Due date deleted (~~2022-10-28~~)

We are sure by now that the approval was due to the same reason as #114694. With https://github.com/openSUSE/qem-bot/pull/77 as an improvement now available in production we can conclude and leave any further task to be handled in #114694

Actions

Copy link

#25

Updated by okurz over 2 years ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public)

Tags

Custom queries

action #117619

Bot approved update request with failing tests size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by tinita over 2 years ago

Updated by tinita over 2 years ago

Updated by tinita over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by tinita over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago

Updated by tinita over 2 years ago

Updated by openqa_review over 2 years ago

Updated by okurz over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by kraih over 2 years ago

Updated by tinita over 2 years ago

Updated by kraih over 2 years ago

Updated by tinita over 2 years ago

Updated by kraih over 2 years ago

Updated by tinita over 2 years ago

Updated by tinita over 2 years ago

Updated by tinita over 2 years ago

Updated by okurz over 2 years ago

Updated by okurz over 2 years ago