Project

General

Profile

action #108824

Some of the daily aggregate tests are cancelled without a reason size:M

Added by jlausuch 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Support
Target version:
Start date:
2022-03-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Some of the tests that are triggered by the bot-ng are cancelled without any apparent reason and missing logs.

Examples:
https://openqa.suse.de/tests/8379473
https://openqa.suse.de/tests/8379476

I have observed this a few times in past days, but I thought it was a sporadic error.
Let's use this ticket to collect these kind of failures.

Acceptance criteria

  • AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)

Suggestions


Related issues

Related to QA - action #107671: No aggregate maintenance runs scheduled today on osd size:MResolved

History

#3 Updated by okurz 3 months ago

  • Project changed from openQA Tests to openQA Project
  • Category set to Concrete Bugs
  • Priority changed from Normal to Urgent
  • Target version set to Ready

#4 Updated by okurz 3 months ago

https://openqa.suse.de/tests/8379476 just says

State: cancelled, finished about 7 hours ago (0)

For the sake of completeness, it also says

Cloned as 8381520

https://openqa.suse.de/admin/auditlog?eventid=11929826 tells us that it was jlausuch manually creating the clone. But that's not a problem, that is just jlausuch fixing what is unexpected, that the job got cancelled.

https://openqa.suse.de/admin/auditlog?eventid=11924952 says that the original job was created by qa-maintenance-automation at 2022-03-24T01:06:24 as part of https://openqa.suse.de/admin/productlog?id=886974 , likely by https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564

The questions that readers of the original job page likely have are:

  1. Who or what cancelled the job
  2. Why was it cancelled
  3. Why could it not have been completed

Could this be a regression due to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 ?

#5 Updated by okurz 3 months ago

  • Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added

#6 Updated by jbaier_cz 3 months ago

okurz wrote:

https://openqa.suse.de/tests/8379476 just says

https://openqa.suse.de/admin/auditlog?eventid=11924952 says that the original job was created by qa-maintenance-automation at 2022-03-24T01:06:24 as part of https://openqa.suse.de/admin/productlog?id=886974 , likely by https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564

As this is an aggregate test, it has to be the "schedule updates" job, according to timestamps I assume https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/896564

The questions that readers of the original job page likely have are:

  1. Who or what cancelled the job
  2. Why was it cancelled
  3. Why could it not have been completed

Could this be a regression due to https://gitlab.suse.de/qa-maintenance/bot-ng/-/merge_requests/50 ?

Not in this case, there is no evidence for it. I do not see any retry attempt in last couple of runs

Some relevant log lines (with the same repohash as https://openqa.suse.de/tests/8379473):

DEBUG: Posting {'openqa': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'DVD-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'qem': {'incidents': ['23269', '23211', '23284', '23272', '23265', '22742', '22557', '23224', '23006', '23299', '23246'], 'settings': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'DVD-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'repohash': 'd208cd28cabadb8da79df71b9588bcc6', 'build': '20220324-1', 'arch': 'x86_64', 'product': 'SLEMICRO50DVD'}, 'api': 'api/update_settings'}

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=DVD-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299

DEBUG: Posting {'openqa': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'MicroOS-Image-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'qem': {'incidents': ['23269', '23211', '23284', '23272', '23265', '22742', '22557', '23224', '23006', '23299', '23246'], 'settings': {'REPOHASH': 'd208cd28cabadb8da79df71b9588bcc6', 'BUILD': '20220324-1', 'DISTRI': 'sle-micro', 'VERSION': '5.0', 'REGISTRY': '18.156.2.117:5000', 'FLAVOR': 'MicroOS-Image-Updates', 'ARCH': 'x86_64', '_OBSOLETE': 1, 'OS_TEST_ISSUES': '22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299'}, 'repohash': 'd208cd28cabadb8da79df71b9588bcc6', 'build': '20220324-1', 'arch': 'x86_64', 'product': 'SLEMICRO50IMAGE'}, 'api': 'api/update_settings'}

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299

What puzzles me a little bit right now, the slem_migration_5.0_to_5.1 job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.

#7 Updated by jlausuch 3 months ago

ph03nix wrote:

Here is one: https://openqa.suse.de/tests/8381554#dependencies

This is not applicable to this ticket, as the cancelled children are due to parent failing in some step: Test died: command 'chown bernhard /dev/ttysclp0 && usermod -a -G tty,dialout,$(stat -c %G /dev/ttysclp0) bernhard' timed out at /usr/lib/os-autoinst/testapi.pm line 950.

#8 Updated by jlausuch 3 months ago

okurz wrote:

For the sake of completeness, it also says

Cloned as 8381520

https://openqa.suse.de/admin/auditlog?eventid=11929826 tells us that it was jlausuch manually creating the clone. But that's not a problem, that is just jlausuch fixing what is unexpected, that the job got cancelled.

Yes, I manually retriggered.

#9 Updated by jlausuch 3 months ago

jbaier_cz wrote:

What puzzles me a little bit right now, the slem_migration_5.0_to_5.1 job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.

There is an explanation for that.
The version is forced to 5.1 because it's the TARGET version as the migration test expects:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22
and
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L69

This is the job yaml changing the version with +VERSION: https://gitlab.suse.de/qac/qac-openqa-yaml/-/blob/master/sle-micro/updates.yaml#L170

This way, the job get's scheduled on product version 5.0 with the right incidents for 5.0 and booting the 5.0 HDD.
If I move this test to 5.1 product, the bot will schedule the incidents for 5.1 but I will be booting the 5.0 HDD, which is wrong approach.

#10 Updated by cdywan 3 months ago

  • Subject changed from Some of the daily aggregate tests are cancelled without a reason to Some of the daily aggregate tests are cancelled without a reason size:M
  • Description updated (diff)
  • Status changed from New to Workable

#11 Updated by jbaier_cz 3 months ago

jlausuch wrote:

jbaier_cz wrote:

What puzzles me a little bit right now, the slem_migration_5.0_to_5.1 job is not mentioned in the bot-ng, the VERSION of that test is 5.1 not 5.0 and this test is never queried in any sync job. That make me wonder if those are really the bot-nq scheduled jobs.

There is an explanation for that.
The version is forced to 5.1 because it's the TARGET version as the migration test expects:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22
and
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L69

This is the job yaml changing the version with +VERSION: https://gitlab.suse.de/qac/qac-openqa-yaml/-/blob/master/sle-micro/updates.yaml#L170

This way, the job get's scheduled on product version 5.0 with the right incidents for 5.0 and booting the 5.0 HDD.
If I move this test to 5.1 product, the bot will schedule the incidents for 5.1 but I will be booting the 5.0 HDD, which is wrong approach.

Ok, so the job created via

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299

is forced to VERSION=5.1, but that would mean, that the job created a second later (a few lines later)

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=b1df128921fdd5cfba2d5a44cea087a7 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.1 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=21371,22557,22623,22742,23006,23007,23008,23211,23216,23224,23246,23253,23264,23267,23277,23284,23291,23299

which will have the same DISTI+FLAVOR+VERSION+ARCH combination will obsolete the first one, right?

#12 Updated by jlausuch 3 months ago

jbaier_cz wrote:

Ok, so the job created via

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=d208cd28cabadb8da79df71b9588bcc6 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.0 REGISTRY=18.156.2.117:5000 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=22557,22742,23006,23211,23224,23246,23265,23269,23272,23284,23299

is forced to VERSION=5.1, but that would mean, that the job created a second later (a few lines later)

INFO: openqa-cli api --host https://openqa.suse.de -X post isos REPOHASH=b1df128921fdd5cfba2d5a44cea087a7 BUILD=20220324-1 DISTRI=sle-micro VERSION=5.1 FLAVOR=MicroOS-Image-Updates ARCH=x86_64 _OBSOLETE=1 OS_TEST_ISSUES=21371,22557,22623,22742,23006,23007,23008,23211,23216,23224,23246,23253,23264,23267,23277,23284,23291,23299

which will have the same DISTI+FLAVOR+VERSION+ARCH combination will obsolete the first one, right?

I'm really not sure how that +VARIABLE works, if it just replaces the variable on runtime or re-triggers the job with updated variable...

If this repeats every day for SLE Micro, then this could be the reason, but I've seen this in the past with other type of tests which don't replace any variable. I will try to monitor jobs during the next days to see if I catch this issue again.

#13 Updated by okurz 3 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz
  • Priority changed from Urgent to High

Ok, so this really looks like a very specific problem and limited in impact. So I am taking the ticket and lowering the priority.

jlausuch I don't plan any changes right now on the side of tooling, basically waiting for your further results and if you can fix it from test side

#14 Updated by jlausuch 3 months ago

okurz wrote:

Ok, so this really looks like a very specific problem and limited in impact. So I am taking the ticket and lowering the priority.

jlausuch I don't plan any changes right now on the side of tooling, basically waiting for your further results and if you can fix it from test side

There is nothing to fix from test side.
Ok, I'll post more links when I see this failure.

#15 Updated by okurz 3 months ago

  • Category changed from Concrete Bugs to Support
  • Status changed from Feedback to Resolved

jlausuch I understood from your last comment that no further work is necessary work so I resolve the ticket despite no changes done in test code (as you mentioned)

Regarding what we stated as acceptance criteria in the ticket description

Acceptance criteria

  • AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)

It is now clear that we don't have an unexpected regression but a cornercase due to the fact that openQA media are defined by the VERSION and that version can be overriden by according test settings so that effectively tests which might appear as being for separate products (including the version) turn out to be the same product from openQA point of view. And as https://gitlab.suse.de/qa-maintenance/bot-ng is instructed to schedule new tests with the variable _OBSOLETE=1 all former jobs of the same product are obsoleted, i.e. cancelled. What can be considered as kind of a workaround is of course to try to not apply the _OBSOLETE setting here which would mean that when instead just deprioritizing older tests are still kept scheduled or running. A similar approach was suggested for aggregate tests in general within https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73

#16 Updated by jlausuch 3 months ago

okurz wrote:

jlausuch I understood from your last comment that no further work is necessary work so I resolve the ticket despite no changes done in test code (as you mentioned)

Regarding what we stated as acceptance criteria in the ticket description

Acceptance criteria

  • AC1: It is clear what the expected scheduling behavior is (e.g. is VERSION supposed to be 5.1 in the job despite VERSION 5.0 being specified when scheduling the product)

It is now clear that we don't have an unexpected regression but a cornercase due to the fact that openQA media are defined by the VERSION and that version can be overriden by according test settings so that effectively tests which might appear as being for separate products (including the version) turn out to be the same product from openQA point of view. And as https://gitlab.suse.de/qa-maintenance/bot-ng is instructed to schedule new tests with the variable _OBSOLETE=1 all former jobs of the same product are obsoleted, i.e. cancelled. What can be considered as kind of a workaround is of course to try to not apply the _OBSOLETE setting here which would mean that when instead just deprioritizing older tests are still kept scheduled or running. A similar approach was suggested for aggregate tests in general within https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/73

Yes, I guess that's the issue. As I'm overriding VERSION, the bot obsoletes those jobs. It happens everyday btw.
How can I stop applying the _OBSOLETE setting in https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/micro50img.yml ?

#17 Updated by okurz 3 months ago

jlausuch wrote:

Yes, I guess that's the issue. As I'm overriding VERSION, the bot obsoletes those jobs. It happens everyday btw.
How can I stop applying the _OBSOLETE setting in https://gitlab.suse.de/qa-maintenance/metadata/-/blob/master/bot-ng/micro50img.yml ?

The best way would be to not actually change the VERSION variable to overlap with something different. Why do you actually need to change that version and not set TARGET_VERSION which is read in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22 ?

#18 Updated by jlausuch 3 months ago

okurz wrote:

The best way would be to not actually change the VERSION variable to overlap with something different. Why do you actually need to change that version and not set TARGET_VERSION which is read in https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/migration/online_migration/zypper_migration.pm#L22 ?

In fact, I implemented this variable today to avoid that problem :)
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/14641

#19 Updated by okurz 3 months ago

Awesome! So I actually looked at your solution already :D

Also available in: Atom PDF