Project

General

Profile

action #117619

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #117694: [epic] Stable and reliable qem-bot

Bot approved update request with failing tests size:M

Added by mgrifalconi about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Incident https://smelt.suse.de/incident/25982/
Request that was approved by sle-qam-openqa: https://build.suse.de/request/show/280720
Bot job: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1166058#L279
INFO: SUSE:Maintenance:25982:280720

Failing test: https://openqa.suse.de/tests/9642631#settings
Dashboard: https://dashboard.qam.suse.de/incident/25982

Context on slack: https://suse.slack.com/archives/C02CANHLANP/p1665043765153419

Acceptance criteria

  • AC1: We know the reason why the bot approved the request and didn't see the test failure

Suggestions

  • Run ./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] inc-approve --dry (see https://github.com/openSUSE/qem-bot/#usage for more info)
  • Look into the dashboard logs on qam2.suse.de journalctl -u dashboard.service
  • Note: The journal only goes back 3 days currently (Oct 3), so for the incident in question it's too late. Consider increasing the journal size as a first step
  • Consider adding code that only runs the bot on a single incident

Related issues

Related to QA - action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:MResolved2022-07-26

Related to openQA Project - action #117655: Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:MResolved2022-10-06

History

#1 Updated by okurz about 2 months ago

  • Target version set to Ready

#2 Updated by tinita about 2 months ago

  • Subject changed from Bot approved update request with failing tests to Bot approved update request with failing tests size:M
  • Description updated (diff)
  • Status changed from New to Workable

#3 Updated by tinita about 2 months ago

  • Description updated (diff)

#4 Updated by tinita about 2 months ago

  • Description updated (diff)

#5 Updated by tinita about 2 months ago

The journal was volatile on qam2, not yet stored in /var/log/journal.
We just created that directory and called journalctl --flush, and now the logs appear there.

mgrifalconi unfortunately the logs for the incident you mentioned are already gone. Logs only start at Oct 3, but in the future logs should be kept longer.
Can you inform us of other incidents like that? Thanks!

#6 Updated by jbaier_cz about 2 months ago

Noting down additional idea, we should do a cross-check of data in the dashboard to find out if there are other incidents which are missing some results from aggregate testing (maybe only for the particular job group).

#7 Updated by jbaier_cz about 2 months ago

  • Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added

#8 Updated by tinita about 2 months ago

We found something that might be related. When looking for builds in the dashboard database, it stopped at 20220906-1 and started again at 20221003-1.

select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202209%' order by build;
    id     |   build    |                concat                | status  
-----------+------------+--------------------------------------+---------
 281542711 | 20220901-1 | https://openqa.suse.de/tests/9429052 | passed
 281542717 | 20220901-1 | https://openqa.suse.de/tests/9429058 | passed
 281543266 | 20220901-1 | https://openqa.suse.de/tests/9429194 | passed
...
 285460574 | 20220906-1 | https://openqa.suse.de/tests/9464999 | passed
 285460533 | 20220906-1 | https://openqa.suse.de/tests/9464953 | passed
 285460204 | 20220906-1 | https://openqa.suse.de/tests/9464896 | passed
 285460513 | 20220906-1 | https://openqa.suse.de/tests/9464932 | passed
 285460129 | 20220906-1 | https://openqa.suse.de/tests/9464817 | passed
 285460160 | 20220906-1 | https://openqa.suse.de/tests/9464850 | passed
(1634 rows)

select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202210%' order by build;
    id     |   build    |                concat                | status  
-----------+------------+--------------------------------------+---------
 301562369 | 20221003-1 | https://openqa.suse.de/tests/9650349 | passed
 301562354 | 20221003-1 | https://openqa.suse.de/tests/9650334 | passed
 301562346 | 20221003-1 | https://openqa.suse.de/tests/9650315 | passed
 302810831 | 20221005-1 | https://openqa.suse.de/tests/9667078 | passed
 302808116 | 20221005-1 | https://openqa.suse.de/tests/9665956 | passed
...

(4228 rows)

So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?

#9 Updated by okurz about 2 months ago

  • Related to action #117655: Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M added

#10 Updated by okurz about 2 months ago

  • Parent task set to #117694

#11 Updated by tinita about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita

#12 Updated by openqa_review about 2 months ago

  • Due date set to 2022-10-25

Setting due date based on mean cycle time of SUSE QE Tools

#13 Updated by okurz about 2 months ago

Discussed in weekly unblock 2022-10-12. We agreed that it is a good idea to extend qem-bot to be started in dry-run mode against a single incident for testing purposes. In parallel we have planned to extend the openQA API in #117655.

#14 Updated by kraih about 2 months ago

tinita wrote:

So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?

I don't, we'll have to talk to Ondrej about that. In the timeframe there were a lot of non-aggregate jobs being updated though:

select count(id) from openqa_jobs where updated <= '2022-10-03' and updated >= '2022-09-06';
 count
-------
 20085
(1 row)

#15 Updated by kraih about 2 months ago

I've added a log message to the dashboard so we can verify in the journal which jobs the bot submitted. https://github.com/openSUSE/qem-dashboard/commit/68e792075023a01efa921c6cffbe0cb709c8be5b

#16 Updated by kraih about 2 months ago

Looking through the openqa_jobs table, i noticed large gaps in the id column:

select id, build, concat('https://openqa.suse.de/tests/', job_id), status, updated from openqa_jobs where updated <= '2022-10-03' order by id desc;
    id     |                         build                         |                concat                | status  |            updated
-----------+-------------------------------------------------------+--------------------------------------+---------+-------------------------------
...
293755646 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583234 | passed  | 2022-09-30 12:47:21.021526+02
 293755645 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583233 | passed  | 2022-09-30 12:47:21.008596+02
 293755644 | :26067:kernel-livepatch-SLE15-SP2_Update_23           | https://openqa.suse.de/tests/9583232 | passed  | 2022-09-30 12:47:20.999165+02
 293754505 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583119 | passed  | 2022-09-25 10:46:41.028612+02
 293754504 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583118 | passed  | 2022-09-25 10:46:41.015559+02
 293754503 | :26048:kgraft-patch-SLE12-SP5_Update_35               | https://openqa.suse.de/tests/9583117 | passed  | 2022-09-25 10:46:41.006355+02
...

That's almost a thousand ids missing between 293755644 and 293754505. Now my first though was of course that the cleanup code is still misbehaving, but given that #114694 has decent test coverage, it's rather unlikely now. However, looking at the code that adds jobs to the table, it is an INSERT ... ON CONFLICT... DO UPDATE ... query. That means it can use up generated ids without ever adding a new record to the table. Perhaps we should start questioning that logic! Maybe it skips adding jobs that should be present.

#17 Updated by tinita about 2 months ago

I got venv and finally pytest working.
I had to set

export OSC_PLUGIN_FAIL_IGNORE=1 

before installing the requirements.

I got the bot running with this:

./bot-ng.py -c /etc/openqabot --token abc --dry  -c tests/fixtures/config/ inc-approve

and was able to replace the incident list with just a single one manually.
Now I'll try to add an option.

#18 Updated by kraih about 2 months ago

I've reopened #114694, since the original problem has resurfaced. This ticket is probably related after all. And i will be taking more drastic action now, the cleanup feature in question will be disabled completely.

#19 Updated by tinita about 2 months ago

I added the option (-I / --incident), but I'm still struggling with writing a test.
https://github.com/perlpunk/qem-bot/tree/approve-single-incident

#20 Updated by kraih about 2 months ago

If we are lucky this problem is now also resolved with the removal of the dashboard cleanup feature.

#21 Updated by tinita about 2 months ago

https://github.com/openSUSE/qem-bot/pull/77 - Add option --incident

I had the tests working at noon, but then I had to rebase to master, and had to mock some more things due to the recent changes.

#22 Updated by tinita about 2 months ago

  • Status changed from In Progress to Feedback

https://github.com/openSUSE/qem-bot/pull/77 was merged

Waiting for #114694 as it might be the same problem.

#23 Updated by tinita about 1 month ago

  • Due date changed from 2022-10-25 to 2022-10-28

Setting due date the same as #114694

#24 Updated by okurz about 1 month ago

  • Due date deleted (2022-10-28)

We are sure by now that the approval was due to the same reason as #114694. With https://github.com/openSUSE/qem-bot/pull/77 as an improvement now available in production we can conclude and leave any further task to be handled in #114694

#25 Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF