action #80264
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
multimachine tests unable to get vars from its pair job
Description
Observation¶
virtualization pair tests began failing today(build #88.1, beta1 candidate), because the job is unable to get variables from its pair. For instance:
https://openqa.nue.suse.com/tests/5057381/file/autoinst-log.txt
[0mUse of uninitialized value in numeric eq (==) at /usr/lib/os-autoinst/mmapi.pm line 212.
Use of uninitialized value in concatenation (.) or string at /usr/lib/os-autoinst/mmapi.pm line 216.
[37m[2020-11-24T03:29:26.211 CET] [debug] get_job_autoinst_vars: code:
[0mUse of uninitialized value in numeric eq (==) at /usr/lib/os-autoinst/mmapi.pm line 212.
The tests are able to run in my local openQA, also they worked in build 78.1(13 days ago), thus It is likely the recent openQA updates broke them.
Steps to reproduce¶
Find corresponding tests as marked on osd. Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , call openqa-query-for-job-label poo#80264
$ env host=osd openqa-query-for-job-label 80264
5057381|2020-11-24 03:44:15|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-xen-dst||linux-1nn1
5057379|2020-11-24 03:25:27|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-kvm-dst||linux-1nn1
5057377|2020-11-24 01:35:09|done|failed|virt-guest-migration-sles12sp5-from-developing-to-developing-xen-dst||linux-1nn1
5057375|2020-11-23 23:24:29|done|failed|virt-guest-migration-sles12sp5-from-developing-to-developing-kvm-dst||linux-1nn1
Problem¶
Maybe due to updates of dependencies or last deployment of osd 2020-11-13, see http://mailman.suse.de/mlarch/SuSE/openqa/2020/openqa.2020.11/msg00004.html or similar as in #76783
Updated by xlai almost 4 years ago
@okurz Would you please help to assign the ticket? It blocks all of our guest migration tests for beta1 which are P0-P2 test priority(key virtualization tests).
Updated by xlai almost 4 years ago
- Project changed from openQA Infrastructure to openQA Project
Updated by okurz almost 4 years ago
- Description updated (diff)
- Category set to Regressions/Crashes
- Status changed from New to Workable
- Target version set to Ready
Sure we can try to help. https://openqa.nue.suse.com/tests/5057381/#next_previous shows only a single job failed in this scenario. I found currently four jobs with this ticket used as label. Are there more?
Updated by okurz almost 4 years ago
- Status changed from Workable to In Progress
We have not really applied any related changes for openQA or os-autoinst. It is just that the network shows some problems and EngInfra could not help us yet. I am trying to fix inconsistent hostnames in #76786 and applied a setting temporarily. I retriggered one test job now as https://openqa.nue.suse.com/tests/5065034 and will monitor the result.
Updated by mkittler almost 4 years ago
thus It is likely the recent openQA updates broke them.
The error indicates that openQA didn't respond at all. So openQA was likely not reachable at all and is therefore likely not the culprit (besides being unreachable). The code in os-autoinst hasn't changed in ages. The OPENQA_URL
variable is set correctly in the test. Maybe network/setup issues causes this job to fail?
Here's a PR for improving the error handling in mmapi.pm
(yet untested): https://github.com/os-autoinst/os-autoinst/pull/1571
Updated by Julie_CAO almost 4 years ago
thanks for your quick action.
These pair jobs still failed in the new build #89.1.
the entire list of failing jobs:
virt-guest-migration-developing-from-developing-to-developing-kvm-dst/src
in build 88.1 https://openqa.nue.suse.com/tests/5057359 & https://openqa.nue.suse.com/tests/5057360
in build 89.1 https://openqa.nue.suse.com/tests/5066482 & https://openqa.nue.suse.com/tests/5066483
virt-guest-migration-developing-from-developing-to-developing-xen-dst/src
in build 88.1 https://openqa.nue.suse.com/tests/5057361 & https://openqa.nue.suse.com/tests/5057362
in build 89.1 https://openqa.nue.suse.com/tests/5066484 & https://openqa.nue.suse.com/tests/5066485
Besides tens of jobs(virt-guest-migration-xxx) in virtualization-milestone group:
https://openqa.nue.suse.com/tests/overview?distri=sle&version=15-SP3&build=88.1&groupid=264
Updated by mkittler almost 4 years ago
My PR seems to be deployed now. So I've restarted one of the clusters: https://openqa.nue.suse.com/tests/5070807
This will not fix anything but hopefully we're getting a better error message. The URL being queries should be logged as well now.
Updated by Julie_CAO almost 4 years ago
Thanks for improving the error message.
"We have not really applied any related changes for openQA or os-autoinst. It is just that the network shows some problems and EngInfra could not help us yet. I am trying to fix inconsistent hostnames in #76786 and applied a setting temporarily. "
Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?
Updated by okurz almost 4 years ago
Julie_CAO wrote:
Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?
We are still not sure what the error could be or what could have recently caused this.
the following new incompletes have appeared:
5066484|2020-11-24 23:11:56|done|failed|virt-guest-migration-developing-from-developing-to-developing-xen-dst||linux-1nn1
5066482|2020-11-24 21:20:02|done|failed|virt-guest-migration-developing-from-developing-to-developing-kvm-dst||linux-1nn1
5065034|2020-11-24 18:22:28|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-xen-dst||linux-1nn1
which are still on the wrong worker hostname. mkittler plans to roll out https://github.com/os-autoinst/os-autoinst/pull/1571 on the affected machine so that, if the problem reappears, the error message would change and should help us.
Updated by Julie_CAO almost 4 years ago
okurz wrote:
Julie_CAO wrote:
Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?
We are still not sure what the error could be or what could have recently caused this.
which are still on the wrong worker hostname. mkittler plans to roll out https://github.com/os-autoinst/os-autoinst/pull/1571 on the affected machine so that, if the problem reappears, the error message would change and should help us.
Got it. Thanks.
Updated by mkittler almost 4 years ago
I also had a look at the code of the openSUSE test distribution and like to note that get_children
, get_job_autoinst_vars
and other mmapi
functions might return undef
when it is not possible to get the variables. So the test code needs to check for undef
and if the lack of these variables is critical die
. My PR didn't change this. It only replaces Perl warnings with better error messages.
If you do not check for undef
in places where you might get undef
you'll see Perl warnings like in https://openqa.nue.suse.com/tests/5066511/file/autoinst-log.txt
[0mUse of uninitialized value $rough_result in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 53.
Use of uninitialized value $installation_failed_guests in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 76.
$VAR1 = undef;
[33m[2020-11-25T15:21:20.687 CET] [info] ::: basetest::runtest: # Test died: Can't use an undefined value as a HASH reference at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/virt_autotest_base.pm line 35.
[0mUse of uninitialized value $rough_result in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 53.
Use of uninitialized value $installation_failed_guests in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 76.
$VAR1 = undef;
[37m[2020-11-25T15:21:20.687 CET] [debug] post_fail_hook failed: Can't use an undefined value as a HASH reference at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/virt_autotest_base.pm line 35.
The job I've restarted has just started to run so let's see what happens.
Updated by mkittler almost 4 years ago
Yesterday I initially thought the hostname problem might be related this mmapi problem. Then I discarded the conclusion because the mmapi queries go from the worker to the web UI. However, that's actually not entirely true. The problematic function get_job_autoinst_vars
makes a query from one worker to another worker based on a hostname previously queries from the web UI. So the wrong hostname might have caused this after all. When this is true https://openqa.nue.suse.com/tests/5070808#live should pass now because the hostname is fixed now.
Updated by Julie_CAO almost 4 years ago
Hi, Yes, the hostname seems to fix the fetching vars failure. The pair job continued to run in your case, https://openqa.nue.suse.com/tests/5070808. It failed due to an existing bug. Besides I checked the paralleled jobs in our job group, they recovered from this issue and running well.
Thanks.
Updated by mkittler almost 4 years ago
Yes, the job now fails and prints Perl warnings like I mentioned in my previous comment. But I've checked the test distribution code and it doesn't look like these variables are initialized from mmapi functions. So likely this bug is unrelated.
Updated by okurz almost 4 years ago
- Copied to action #80412: tests fail with auto_review:"(?s)version is 4\.6\.1606298538\.191b5988.*Can.*t locate object method.*code.*via package":retry added
Updated by okurz almost 4 years ago
the latest merged PR caused a regression as seen on o3, handled in #80412
Updated by okurz almost 4 years ago
- Due date set to 2020-12-03
- Priority changed from Urgent to Normal
with the above confirmation we can reduce the priority but still plan to followup with the improvements
Updated by mkittler almost 4 years ago
Updated by mkittler almost 4 years ago
- Status changed from In Progress to Feedback
Updated by okurz almost 4 years ago
- Due date deleted (
2020-12-03) - Status changed from Feedback to Resolved
https://github.com/os-autoinst/os-autoinst/pull/1578 merged. Given that the original problem is already solved and likely it would take a very long time until we see the better error handling in action we can regard this ticket as done. However, of course it could happen that unexpected regressions are introduced which we would only see in production, e.g. from o3 tomorrow, in multi-machine test scenarios. We will hear about it then :)
Updated by szarate about 1 year ago
- Copied to action #136154: multimachine tests restarted by RETRY test variable end up without the proper dependency size:M added