action #117619
closedcoordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
coordination #117694: [epic] Stable and reliable qem-bot
Bot approved update request with failing tests size:M
0%
Description
Observation¶
Incident https://smelt.suse.de/incident/25982/
Request that was approved by sle-qam-openqa: https://build.suse.de/request/show/280720
Bot job: https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1166058#L279
INFO: SUSE:Maintenance:25982:280720
Failing test: https://openqa.suse.de/tests/9642631#settings
Dashboard: https://dashboard.qam.suse.de/incident/25982
Context on slack: https://suse.slack.com/archives/C02CANHLANP/p1665043765153419
Acceptance criteria¶
- AC1: We know the reason why the bot approved the request and didn't see the test failure
Suggestions¶
- Run
./qem-bot/bot-ng.py -c /etc/openqabot --token [MASKED] inc-approve --dry
(see https://github.com/openSUSE/qem-bot/#usage for more info) - Look into the dashboard logs on qam2.suse.de
journalctl -u dashboard.service
- Note: The journal only goes back 3 days currently (Oct 3), so for the incident in question it's too late. Consider increasing the journal size as a first step
- Consider adding code that only runs the bot on a single incident
Updated by tinita about 2 years ago
- Subject changed from Bot approved update request with failing tests to Bot approved update request with failing tests size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita about 2 years ago
The journal was volatile on qam2, not yet stored in /var/log/journal
.
We just created that directory and called journalctl --flush
, and now the logs appear there.
@mgrifalconi unfortunately the logs for the incident you mentioned are already gone. Logs only start at Oct 3, but in the future logs should be kept longer.
Can you inform us of other incidents like that? Thanks!
Updated by jbaier_cz about 2 years ago
Noting down additional idea, we should do a cross-check of data in the dashboard to find out if there are other incidents which are missing some results from aggregate testing (maybe only for the particular job group).
Updated by jbaier_cz about 2 years ago
- Related to action #114694: Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M added
Updated by tinita about 2 years ago
We found something that might be related. When looking for builds in the dashboard database, it stopped at 20220906-1
and started again at 20221003-1
.
select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202209%' order by build;
id | build | concat | status
-----------+------------+--------------------------------------+---------
281542711 | 20220901-1 | https://openqa.suse.de/tests/9429052 | passed
281542717 | 20220901-1 | https://openqa.suse.de/tests/9429058 | passed
281543266 | 20220901-1 | https://openqa.suse.de/tests/9429194 | passed
...
285460574 | 20220906-1 | https://openqa.suse.de/tests/9464999 | passed
285460533 | 20220906-1 | https://openqa.suse.de/tests/9464953 | passed
285460204 | 20220906-1 | https://openqa.suse.de/tests/9464896 | passed
285460513 | 20220906-1 | https://openqa.suse.de/tests/9464932 | passed
285460129 | 20220906-1 | https://openqa.suse.de/tests/9464817 | passed
285460160 | 20220906-1 | https://openqa.suse.de/tests/9464850 | passed
(1634 rows)
select id, build, concat('https://openqa.suse.de/tests/', job_id), status from openqa_jobs where build like '202210%' order by build;
id | build | concat | status
-----------+------------+--------------------------------------+---------
301562369 | 20221003-1 | https://openqa.suse.de/tests/9650349 | passed
301562354 | 20221003-1 | https://openqa.suse.de/tests/9650334 | passed
301562346 | 20221003-1 | https://openqa.suse.de/tests/9650315 | passed
302810831 | 20221005-1 | https://openqa.suse.de/tests/9667078 | passed
302808116 | 20221005-1 | https://openqa.suse.de/tests/9665956 | passed
...
(4228 rows)
So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?
Updated by okurz about 2 years ago
- Related to action #117655: Provide API to get job results for a particular incident, similar to what dashboard/qem-bot does size:M added
Updated by tinita about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by openqa_review about 2 years ago
- Due date set to 2022-10-25
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 years ago
Discussed in weekly unblock 2022-10-12. We agreed that it is a good idea to extend qem-bot to be started in dry-run mode against a single incident for testing purposes. In parallel we have planned to extend the openQA API in #117655.
Updated by kraih about 2 years ago
tinita wrote:
So that build from the test https://openqa.suse.de/tests/9642631 20221001-1 is not in the dashboard database, and maybe someone has an idea why there is this large gap from Sep 6 to Oct 3?
I don't, we'll have to talk to Ondrej about that. In the timeframe there were a lot of non-aggregate jobs being updated though:
select count(id) from openqa_jobs where updated <= '2022-10-03' and updated >= '2022-09-06';
count
-------
20085
(1 row)
Updated by kraih about 2 years ago
I've added a log message to the dashboard so we can verify in the journal which jobs the bot submitted. https://github.com/openSUSE/qem-dashboard/commit/68e792075023a01efa921c6cffbe0cb709c8be5b
Updated by kraih about 2 years ago
Looking through the openqa_jobs
table, i noticed large gaps in the id column:
select id, build, concat('https://openqa.suse.de/tests/', job_id), status, updated from openqa_jobs where updated <= '2022-10-03' order by id desc;
id | build | concat | status | updated
-----------+-------------------------------------------------------+--------------------------------------+---------+-------------------------------
...
293755646 | :26067:kernel-livepatch-SLE15-SP2_Update_23 | https://openqa.suse.de/tests/9583234 | passed | 2022-09-30 12:47:21.021526+02
293755645 | :26067:kernel-livepatch-SLE15-SP2_Update_23 | https://openqa.suse.de/tests/9583233 | passed | 2022-09-30 12:47:21.008596+02
293755644 | :26067:kernel-livepatch-SLE15-SP2_Update_23 | https://openqa.suse.de/tests/9583232 | passed | 2022-09-30 12:47:20.999165+02
293754505 | :26048:kgraft-patch-SLE12-SP5_Update_35 | https://openqa.suse.de/tests/9583119 | passed | 2022-09-25 10:46:41.028612+02
293754504 | :26048:kgraft-patch-SLE12-SP5_Update_35 | https://openqa.suse.de/tests/9583118 | passed | 2022-09-25 10:46:41.015559+02
293754503 | :26048:kgraft-patch-SLE12-SP5_Update_35 | https://openqa.suse.de/tests/9583117 | passed | 2022-09-25 10:46:41.006355+02
...
That's almost a thousand ids missing between 293755644 and 293754505. Now my first though was of course that the cleanup code is still misbehaving, but given that #114694 has decent test coverage, it's rather unlikely now. However, looking at the code that adds jobs to the table, it is an INSERT ... ON CONFLICT... DO UPDATE ...
query. That means it can use up generated ids without ever adding a new record to the table. Perhaps we should start questioning that logic! Maybe it skips adding jobs that should be present.
Updated by tinita about 2 years ago
I got venv
and finally pytest
working.
I had to set
export OSC_PLUGIN_FAIL_IGNORE=1
before installing the requirements.
I got the bot running with this:
./bot-ng.py -c /etc/openqabot --token abc --dry -c tests/fixtures/config/ inc-approve
and was able to replace the incident list with just a single one manually.
Now I'll try to add an option.
Updated by kraih about 2 years ago
I've reopened #114694, since the original problem has resurfaced. This ticket is probably related after all. And i will be taking more drastic action now, the cleanup feature in question will be disabled completely.
Updated by tinita about 2 years ago
I added the option (-I
/ --incident
), but I'm still struggling with writing a test.
https://github.com/perlpunk/qem-bot/tree/approve-single-incident
Updated by kraih about 2 years ago
If we are lucky this problem is now also resolved with the removal of the dashboard cleanup feature.
Updated by tinita about 2 years ago
https://github.com/openSUSE/qem-bot/pull/77 - Add option --incident
I had the tests working at noon, but then I had to rebase to master, and had to mock some more things due to the recent changes.
Updated by tinita about 2 years ago
- Status changed from In Progress to Feedback
https://github.com/openSUSE/qem-bot/pull/77 was merged
Waiting for #114694 as it might be the same problem.
Updated by tinita about 2 years ago
- Due date changed from 2022-10-25 to 2022-10-28
Setting due date the same as #114694
Updated by okurz about 2 years ago
- Due date deleted (
2022-10-28)
We are sure by now that the approval was due to the same reason as #114694. With https://github.com/openSUSE/qem-bot/pull/77 as an improvement now available in production we can conclude and leave any further task to be handled in #114694