action #108824
closedSome of the daily aggregate tests are cancelled without a reason size:M
Description
Some of the tests that are triggered by the bot-ng are cancelled without any apparent reason and missing logs.
Examples:
https://openqa.suse.de/tests/8379473
https://openqa.suse.de/tests/8379476
I have observed this a few times in past days, but I thought it was a sporadic error.
Let's use this ticket to collect these kind of failures.
Acceptance criteria¶
- AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)
Suggestions¶
- Clarify with the author of the test - this test is very new
- Initial hypothesis: Have these jobs not been scheduled by bot-ng?
- Find a way to distinguish bot jobs better
- Find out why affected jobs are scheduled with 5.0 but show VERSION 5.1 in the settings
- https://gitlab.suse.de/qac/qac-openqa-yaml/-/blob/master/sle-micro/updates.yaml#L170
- Examine bot-ng logs for scheduling the sle-micro, look for both 5.0 and 5.1 versions (the other one could obsolete the first one)
- Product log for 5.0 scheduling: https://openqa.suse.de/admin/productlog?id=886973
- Product log for 5.1 scheduling: https://openqa.suse.de/admin/productlog?id=886978
- If the bot cancells jobs as expected, comment "Cancelled because of job #foo" on relevant jobs
Updated by ph03nix over 2 years ago
Here is one: https://openqa.suse.de/tests/8381554#dependencies
Updated by ph03nix over 2 years ago
Updated by okurz over 2 years ago
- Project changed from openQA Tests to openQA Project
- Category set to Regressions/Crashes
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by okurz over 2 years ago
https://openqa.suse.de/tests/8379476 just says
State: cancelled, finished about 7 hours ago (0)
For the sake of completeness, it also says
Cloned as 8381520
https://openqa.suse.de/admin/auditlog?eventid=11929826 tells us that it was jlausuch manually creating the clone. But that's not a problem, that is just jlausuch fixing what is unexpected, that the job got cancelled.
https://openqa.suse.de/admin/auditlog?eventid=11924952 says that the original job was created by qa-maintenance-automation at 2022-03-24T01:06:24 as part of https://openqa.suse.de/admin/productlog?id=886974 , likely by https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564
The questions that readers of the original job page likely have are:
- Who or what cancelled the job
- Why was it cancelled
- Why could it not have been completed
Could this be a regression due to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 ?
Updated by okurz over 2 years ago
- Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added
Updated by jbaier_cz over 2 years ago
okurz wrote:
https://openqa.suse.de/tests/8379476 just says
https://openqa.suse.de/admin/auditlog?eventid=11924952 says that the original job was created by qa-maintenance-automation at 2022-03-24T01:06:24 as part of https://openqa.suse.de/admin/productlog?id=886974 , likely by https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564
As this is an aggregate test, it has to be the "schedule updates" job, according to timestamps I assume https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564
The questions that readers of the original job page likely have are:
- Who or what cancelled the job
- Why was it cancelled
- Why could it not have been completed
Could this be a regression due to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 ?
Not in this case, there is no evidence for it. I do not see any retry attempt in last couple of runs
Some relevant log lines (with the same repohash as https://openqa.suse.de/tests/8379473):
DEBUG: Posting {'openqa': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'DVD-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'qem': {'incidents': ['23269', '23211', '23284', '23272', '23265', '22742', '22557', '23224', '23006', '23299', '23246'], 'settings': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'DVD-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'repohash': 'd208cd28cabadb8da79df71b9588bcc6', 'build': '20220324-1', 'arch': 'x86_64', 'product': 'SLEMICRO50DVD'}, 'api': 'api/update_settings'}
INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=DVD-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299
DEBUG: Posting {'openqa': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'MicroOS-Image-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'qem': {'incidents': ['23269', '23211', '23284', '23272', '23265', '22742', '22557', '23224', '23006', '23299', '23246'], 'settings': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'MicroOS-Image-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'repohash': 'd208cd28cabadb8da79df71b9588bcc6', 'build': '20220324-1', 'arch': 'x86_64', 'product': 'SLEMICRO50IMAGE'}, 'api': 'api/update_settings'}
INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299
What puzzles me a little bit right now, the slem_migration_5.0_to_5.1
job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.
Updated by jlausuch over 2 years ago
ph03nix wrote:
Here is one: https://openqa.suse.de/tests/8381554#dependencies
This is not applicable to this ticket, as the cancelled children are due to parent failing in some step: Test died: command 'chown bernhard /dev/ttysclp0 && usermod -a -G tty,dialout,$(stat -c %G /dev/ttysclp0) bernhard' timed out at /usr/lib/os-autoinst/testapi.pm line 950.
Updated by jlausuch over 2 years ago
okurz wrote:
For the sake of completeness, it also says
Cloned as 8381520
https://openqa.suse.de/admin/auditlog?eventid=11929826 tells us that it was jlausuch manually creating the clone. But that's not a problem, that is just jlausuch fixing what is unexpected, that the job got cancelled.
Yes, I manually retriggered.
Updated by jlausuch over 2 years ago
jbaier_cz wrote:
What puzzles me a little bit right now, the
slem_migration_5.0_to_5.1
job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.
There is an explanation for that.
The version is forced to 5.1 because it's the TARGET version as the migration test expects:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22
and
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L69
This is the job yaml changing the version with +VERSION
: https://gitlab.suse.de/qac/qac-openqa-yaml/-/blob/master/sle-micro/updates.yaml#L170
This way, the job get's scheduled on product version 5.0 with the right incidents for 5.0 and booting the 5.0 HDD.
If I move this test to 5.1 product, the bot will schedule the incidents for 5.1 but I will be booting the 5.0 HDD, which is wrong approach.
Updated by livdywan over 2 years ago
- Subject changed from Some of the daily aggregate tests are cancelled without a reason to Some of the daily aggregate tests are cancelled without a reason size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jbaier_cz over 2 years ago
jlausuch wrote:
jbaier_cz wrote:
What puzzles me a little bit right now, the
slem_migration_5.0_to_5.1
job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.There is an explanation for that.
The version is forced to 5.1 because it's the TARGET version as the migration test expects:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22
and
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L69This is the job yaml changing the version with
+VERSION
: https://gitlab.suse.de/qac/qac-openqa-yaml/-/blob/master/sle-micro/updates.yaml#L170This way, the job get's scheduled on product version 5.0 with the right incidents for 5.0 and booting the 5.0 HDD.
If I move this test to 5.1 product, the bot will schedule the incidents for 5.1 but I will be booting the 5.0 HDD, which is wrong approach.
Ok, so the job created via
INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299
is forced to VERSION=5.1
, but that would mean, that the job created a second later (a few lines later)
INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=b1df128921fdd5cfba2d5a44cea087a7 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.1 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=21371,22557,22623,22742,23006,23007,23008,23211,23216,23224,23246,23253,23264,23267,23277,23284,23291,23299
which will have the same DISTI+FLAVOR+VERSION+ARCH combination will obsolete the first one, right?
Updated by jlausuch over 2 years ago
jbaier_cz wrote:
Ok, so the job created via
INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299
is forced to
VERSION=5.1
, but that would mean, that the job created a second later (a few lines later)INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=b1df128921fdd5cfba2d5a44cea087a7 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.1 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=21371,22557,22623,22742,23006,23007,23008,23211,23216,23224,23246,23253,23264,23267,23277,23284,23291,23299
which will have the same DISTI+FLAVOR+VERSION+ARCH combination will obsolete the first one, right?
I'm really not sure how that +VARIABLE
works, if it just replaces the variable on runtime or re-triggers the job with updated variable...
If this repeats every day for SLE Micro, then this could be the reason, but I've seen this in the past with other type of tests which don't replace any variable. I will try to monitor jobs during the next days to see if I catch this issue again.
Updated by okurz over 2 years ago
- Status changed from Workable to Feedback
- Assignee set to okurz
- Priority changed from Urgent to High
Ok, so this really looks like a very specific problem and limited in impact. So I am taking the ticket and lowering the priority.
@jlausuch I don't plan any changes right now on the side of tooling, basically waiting for your further results and if you can fix it from test side
Updated by jlausuch over 2 years ago
okurz wrote:
Ok, so this really looks like a very specific problem and limited in impact. So I am taking the ticket and lowering the priority.
@jlausuch I don't plan any changes right now on the side of tooling, basically waiting for your further results and if you can fix it from test side
There is nothing to fix from test side.
Ok, I'll post more links when I see this failure.
Updated by okurz over 2 years ago
- Category changed from Regressions/Crashes to Support
- Status changed from Feedback to Resolved
@jlausuch I understood from your last comment that no further work is necessary work so I resolve the ticket despite no changes done in test code (as you mentioned)
Regarding what we stated as acceptance criteria in the ticket description
Acceptance criteria¶
- AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)
It is now clear that we don't have an unexpected regression but a cornercase due to the fact that openQA media are defined by the VERSION and that version can be overriden by according test settings so that effectively tests which might appear as being for separate products (including the version) turn out to be the same product from openQA point of view. And as https://gitlab.suse.de/qa-maintenance/bot-ng is instructed to schedule new tests with the variable _OBSOLETE=1
all former jobs of the same product are obsoleted, i.e. cancelled. What can be considered as kind of a workaround is of course to try to not apply the _OBSOLETE
setting here which would mean that when instead just deprioritizing older tests are still kept scheduled or running. A similar approach was suggested for aggregate tests in general within https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73
Updated by jlausuch over 2 years ago
okurz wrote:
@jlausuch I understood from your last comment that no further work is necessary work so I resolve the ticket despite no changes done in test code (as you mentioned)
Regarding what we stated as acceptance criteria in the ticket description
Acceptance criteria¶
- AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)
It is now clear that we don't have an unexpected regression but a cornercase due to the fact that openQA media are defined by the VERSION and that version can be overriden by according test settings so that effectively tests which might appear as being for separate products (including the version) turn out to be the same product from openQA point of view. And as https://gitlab.suse.de/qa-maintenance/bot-ng is instructed to schedule new tests with the variable
_OBSOLETE=1
all former jobs of the same product are obsoleted, i.e. cancelled. What can be considered as kind of a workaround is of course to try to not apply the_OBSOLETE
setting here which would mean that when instead just deprioritizing older tests are still kept scheduled or running. A similar approach was suggested for aggregate tests in general within https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73
Yes, I guess that's the issue. As I'm overriding VERSION
, the bot obsoletes those jobs. It happens everyday btw.
How can I stop applying the _OBSOLETE
setting in https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/micro50img.yml ?
Updated by okurz over 2 years ago
jlausuch wrote:
Yes, I guess that's the issue. As I'm overriding
VERSION
, the bot obsoletes those jobs. It happens everyday btw.
How can I stop applying the_OBSOLETE
setting in https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/micro50img.yml ?
The best way would be to not actually change the VERSION variable to overlap with something different. Why do you actually need to change that version and not set TARGET_VERSION which is read in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22 ?
Updated by jlausuch over 2 years ago
okurz wrote:
The best way would be to not actually change the VERSION variable to overlap with something different. Why do you actually need to change that version and not set TARGET_VERSION which is read in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22 ?
In fact, I implemented this variable today to avoid that problem :)
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14641
Updated by okurz over 2 years ago
Awesome! So I actually looked at your solution already :D