Project

General

Profile

action #80264

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

multimachine tests unable to get vars from its pair job

Added by Julie_CAO 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2020-11-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

virtualization pair tests began failing today(build #88.1, beta1 candidate), because the job is unable to get variables from its pair. For instance:

https://openqa.nue.suse.com/tests/5057381/file/autoinst-log.txt
Use of uninitialized value in numeric eq (==) at /usr/lib/os-autoinst/mmapi.pm line 212.
Use of uninitialized value in concatenation (.) or string at /usr/lib/os-autoinst/mmapi.pm line 216.
[2020-11-24T03:29:26.211 CET] [debug] get_job_autoinst_vars: code:
Use of uninitialized value in numeric eq (==) at /usr/lib/os-autoinst/mmapi.pm line 212.

The tests are able to run in my local openQA, also they worked in build 78.1(13 days ago), thus It is likely the recent openQA updates broke them.

Steps to reproduce

Find corresponding tests as marked on osd. Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , call openqa-query-for-job-label poo#80264

$ env host=osd openqa-query-for-job-label 80264
5057381|2020-11-24 03:44:15|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-xen-dst||linux-1nn1
5057379|2020-11-24 03:25:27|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-kvm-dst||linux-1nn1
5057377|2020-11-24 01:35:09|done|failed|virt-guest-migration-sles12sp5-from-developing-to-developing-xen-dst||linux-1nn1
5057375|2020-11-23 23:24:29|done|failed|virt-guest-migration-sles12sp5-from-developing-to-developing-kvm-dst||linux-1nn1

Problem

Maybe due to updates of dependencies or last deployment of osd 2020-11-13, see http://mailman.suse.de/mlarch/SuSE/openqa/2020/openqa.2020.11/msg00004.html or similar as in #76783


Related issues

Copied to openQA Project - action #80412: tests fail with auto_review:"(?s)version is 4\.6\.1606298538\.191b5988.*Can.*t locate object method.*code.*via package":retryResolved2020-11-24

History

#1 Updated by xlai 7 months ago

okurz Would you please help to assign the ticket? It blocks all of our guest migration tests for beta1 which are P0-P2 test priority(key virtualization tests).

#2 Updated by xlai 7 months ago

  • Project changed from openQA Infrastructure to openQA Project

#3 Updated by okurz 7 months ago

  • Description updated (diff)
  • Category set to Concrete Bugs
  • Status changed from New to Workable
  • Target version set to Ready

Sure we can try to help. https://openqa.nue.suse.com/tests/5057381/#next_previous shows only a single job failed in this scenario. I found currently four jobs with this ticket used as label. Are there more?

#4 Updated by okurz 7 months ago

  • Priority changed from High to Urgent

#5 Updated by mkittler 7 months ago

  • Assignee set to mkittler

#6 Updated by okurz 7 months ago

  • Status changed from Workable to In Progress

We have not really applied any related changes for openQA or os-autoinst. It is just that the network shows some problems and EngInfra could not help us yet. I am trying to fix inconsistent hostnames in #76786 and applied a setting temporarily. I retriggered one test job now as https://openqa.nue.suse.com/tests/5065034 and will monitor the result.

#7 Updated by mkittler 7 months ago

thus It is likely the recent openQA updates broke them.

The error indicates that openQA didn't respond at all. So openQA was likely not reachable at all and is therefore likely not the culprit (besides being unreachable). The code in os-autoinst hasn't changed in ages. The OPENQA_URL variable is set correctly in the test. Maybe network/setup issues causes this job to fail?

Here's a PR for improving the error handling in mmapi.pm (yet untested): https://github.com/os-autoinst/os-autoinst/pull/1571

#8 Updated by Julie_CAO 7 months ago

thanks for your quick action.

These pair jobs still failed in the new build #89.1.

the entire list of failing jobs:
virt-guest-migration-developing-from-developing-to-developing-kvm-dst/src
in build 88.1 https://openqa.nue.suse.com/tests/5057359 & https://openqa.nue.suse.com/tests/5057360
in build 89.1 https://openqa.nue.suse.com/tests/5066482 & https://openqa.nue.suse.com/tests/5066483

virt-guest-migration-developing-from-developing-to-developing-xen-dst/src
in build 88.1 https://openqa.nue.suse.com/tests/5057361 & https://openqa.nue.suse.com/tests/5057362
in build 89.1 https://openqa.nue.suse.com/tests/5066484 & https://openqa.nue.suse.com/tests/5066485

Besides tens of jobs(virt-guest-migration-xxx) in virtualization-milestone group:
https://openqa.nue.suse.com/tests/overview?distri=sle&version=15-SP3&build=88.1&groupid=264

#9 Updated by mkittler 7 months ago

My PR seems to be deployed now. So I've restarted one of the clusters: https://openqa.nue.suse.com/tests/5070807

This will not fix anything but hopefully we're getting a better error message. The URL being queries should be logged as well now.

#10 Updated by okurz 7 months ago

  • Estimated time set to 39719.00 h

#11 Updated by Julie_CAO 7 months ago

Thanks for improving the error message.

"We have not really applied any related changes for openQA or os-autoinst. It is just that the network shows some problems and EngInfra could not help us yet. I am trying to fix inconsistent hostnames in #76786 and applied a setting temporarily. "

Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?

#12 Updated by okurz 7 months ago

  • Estimated time deleted (39719.00 h)

#13 Updated by okurz 7 months ago

  • Parent task set to #39719

#14 Updated by okurz 7 months ago

Julie_CAO wrote:

Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?

We are still not sure what the error could be or what could have recently caused this.

the following new incompletes have appeared:

5066484|2020-11-24 23:11:56|done|failed|virt-guest-migration-developing-from-developing-to-developing-xen-dst||linux-1nn1
5066482|2020-11-24 21:20:02|done|failed|virt-guest-migration-developing-from-developing-to-developing-kvm-dst||linux-1nn1
5065034|2020-11-24 18:22:28|done|failed|virt-guest-migration-sles12sp5-from-sles12sp5-to-developing-xen-dst||linux-1nn1

which are still on the wrong worker hostname. mkittler plans to roll out https://github.com/os-autoinst/os-autoinst/pull/1571 on the affected machine so that, if the problem reappears, the error message would change and should help us.

#15 Updated by Julie_CAO 7 months ago

okurz wrote:

Julie_CAO wrote:

Does that mean that an Infra ticket will not help, and the fix of #76786 will make the pair jobs run ultimately?

We are still not sure what the error could be or what could have recently caused this.

which are still on the wrong worker hostname. mkittler plans to roll out https://github.com/os-autoinst/os-autoinst/pull/1571 on the affected machine so that, if the problem reappears, the error message would change and should help us.

Got it. Thanks.

#16 Updated by mkittler 7 months ago

I also had a look at the code of the openSUSE test distribution and like to note that get_children, get_job_autoinst_vars and other mmapi functions might return undef when it is not possible to get the variables. So the test code needs to check for undef and if the lack of these variables is critical die. My PR didn't change this. It only replaces Perl warnings with better error messages.

If you do not check for undef in places where you might get undef you'll see Perl warnings like in https://openqa.nue.suse.com/tests/5066511/file/autoinst-log.txt

Use of uninitialized value $rough_result in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 53.
Use of uninitialized value $installation_failed_guests in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 76.
$VAR1 = undef;
[2020-11-25T15:21:20.687 CET] [info] ::: basetest::runtest: # Test died: Can't use an undefined value as a HASH reference at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/virt_autotest_base.pm line 35.

Use of uninitialized value $rough_result in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 53.
Use of uninitialized value $installation_failed_guests in split at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/guest_migration_src.pm line 76.
$VAR1 = undef;
[2020-11-25T15:21:20.687 CET] [debug] post_fail_hook failed: Can't use an undefined value as a HASH reference at /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/virt_autotest/virt_autotest_base.pm line 35.

The job I've restarted has just started to run so let's see what happens.

#17 Updated by mkittler 7 months ago

Yesterday I initially thought the hostname problem might be related this mmapi problem. Then I discarded the conclusion because the mmapi queries go from the worker to the web UI. However, that's actually not entirely true. The problematic function get_job_autoinst_vars makes a query from one worker to another worker based on a hostname previously queries from the web UI. So the wrong hostname might have caused this after all. When this is true https://openqa.nue.suse.com/tests/5070808#live should pass now because the hostname is fixed now.

#18 Updated by Julie_CAO 7 months ago

Hi, Yes, the hostname seems to fix the fetching vars failure. The pair job continued to run in your case, https://openqa.nue.suse.com/tests/5070808. It failed due to an existing bug. Besides I checked the paralleled jobs in our job group, they recovered from this issue and running well.

Thanks.

#19 Updated by mkittler 7 months ago

Yes, the job now fails and prints Perl warnings like I mentioned in my previous comment. But I've checked the test distribution code and it doesn't look like these variables are initialized from mmapi functions. So likely this bug is unrelated.

#20 Updated by okurz 7 months ago

  • Copied to action #80412: tests fail with auto_review:"(?s)version is 4\.6\.1606298538\.191b5988.*Can.*t locate object method.*code.*via package":retry added

#21 Updated by okurz 7 months ago

the latest merged PR caused a regression as seen on o3, handled in #80412

#22 Updated by okurz 7 months ago

  • Due date set to 2020-12-03
  • Priority changed from Urgent to Normal

with the above confirmation we can reduce the priority but still plan to followup with the improvements

#24 Updated by mkittler 7 months ago

  • Status changed from In Progress to Feedback

#25 Updated by okurz 7 months ago

  • Due date deleted (2020-12-03)
  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/os-autoinst/pull/1578 merged. Given that the original problem is already solved and likely it would take a very long time until we see the better error handling in action we can regard this ticket as done. However, of course it could happen that unexpected regressions are introduced which we would only see in production, e.g. from o3 tomorrow, in multi-machine test scenarios. We will hear about it then :)

Also available in: Atom PDF