Project

General

Profile

action #101882

[qe-core] aarch64 workers: test fails in patch_and_reboot

Added by mgrifalconi 3 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Test expects to find the bootloader needle ( https://openqa.suse.de/tests/7588098#step/patch_and_reboot/52 ) but somehow it is not picked by openqa and it fails.
We could at the same time check for the login needle and skip the need of the bootloader if not found I suppose.

openQA test in scenario sle-15-Server-DVD-Updates-aarch64-qam-gnome@aarch64-virtio fails in
patch_and_reboot

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20211103-1

Expected result

Last good: 20211102-1 (or more recent)

Further details

Always latest result in this scenario: latest

History

#1 Updated by dzedro 3 months ago

This is aarch64, similar issues happen also on another tests and I don't think this does have anything with test,
but with aarch64 worker, the job just will get stuck, high load or I/O don't know what exactly.
Another case is zypper fail when it does time out, because the process just hanged.

#2 Updated by tjyrinki_suse about 2 months ago

  • Subject changed from [qe-core] test fails in patch_and_reboot to aarch64 workers: test fails in patch_and_reboot
  • Start date deleted (2021-11-03)

Adjusting according to previous comment. Maybe loads at aarch64 workers should be more limited to be lower?

#3 Updated by okurz about 2 months ago

  • Subject changed from aarch64 workers: test fails in patch_and_reboot to [qe-core] aarch64 workers: test fails in patch_and_reboot

tjyrinki_suse wrote:

Adjusting according to previous comment.

so who should pick it?

Maybe loads at aarch64 workers should be more limited to be lower?

Well, the load would be more limited if we run even less openQA worker instances on the hosts. All of that is configured in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls however we already reduced a lot from what the machines should theoretically be capable of. And the current machines openqaworker-arm-[123] don't really suffer from high load, see e.g. https://monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?viewPanel=65092&orgId=1&refresh=1m&from=now-90d&to=now and https://monitor.qa.suse.de/d/WDopenqaworker-arm-2/worker-dashboard-openqaworker-arm-2?viewPanel=65092&orgId=1&refresh=1m&from=now-30d&to=now and https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?viewPanel=65092&orgId=1&refresh=1m&from=now-30d&to=now where even the CPU usage spikes are well below 40%. There are two new machines openqaworker-arm-4 and openqaworker-arm-5 which make their own kind of problems and have been removed from production again 2 months ago, see #101048 . You might benefit from comparing the stability of tests with the aarch64 machine we have in the o3 network. That one seems to be much more stable, don't know why. ggardet might also be of help, you can reach out to ggardet e.g. in #opensuse-factory

#4 Updated by tjyrinki_suse about 2 months ago

  • Subject changed from [qe-core] aarch64 workers: test fails in patch_and_reboot to [infrastructure] aarch64 workers: test fails in patch_and_reboot
  • Category changed from Bugs in existing tests to Infrastructure
  • Target version set to Ready

I was meaning tools team as this seems like an infrastructure issue, maybe these changes make it imply that better.

Currently I do not have issues how the tests could be made more stable as they are working for other architectures.

#5 Updated by tjyrinki_suse about 2 months ago

  • Priority changed from High to Normal

I think is rare enough though not to be high priority.

#6 Updated by okurz about 2 months ago

  • Subject changed from [infrastructure] aarch64 workers: test fails in patch_and_reboot to [qe-core] aarch64 workers: test fails in patch_and_reboot
  • Category changed from Infrastructure to Bugs in existing tests
  • Target version deleted (Ready)

We looked at the problem during the SUSE QE Tools workshop session 2021-12-10. We checked the referenced scenario from the ticket description https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Server-DVD-Updates&machine=aarch64-virtio&test=qam-gnome&version=15 and in the 38 available test results we found two occurences of "patch_and_reboot" failing. https://openqa.suse.de/tests/7798294#step/patch_and_reboot/62 shows a zypper conflict so certainly not related to "infrastructure" or aarch64 worker performance specific. The other case is https://openqa.suse.de/tests/7596208#step/patch_and_reboot/55 which shows that the test expected to see the grub bootloader but we find the gdm login window instead. https://openqa.suse.de/tests/7596208/file/patch_and_reboot-dmesg.log shows that an actual reboot has happened in between. https://openqa.suse.de/tests/7596208/logfile?filename=autoinst-log.txt shows

Suggestions

  1. Fix "WARNING: check_asserted_screen took 0.83 seconds for 14 candidate needles - make your needles more specific" from autoinst-log.txt to have a higher chance of matching within a reasonable time -> Delete unused, obsolete, duplicate needles
  2. Videos can help for debugging -> Consider splitting the scenarios or explicitly
  3. Fix the bootloader check race condition: This can be solved in multiple ways a) Use the same approach as in installation jobs to disable the grub timeout so that the boot menu will stay until explicitly handled b) don't expect the bootloader menu if it's not needed to stop there
  4. Update the maintainer of patch_and_reboot. It still has coolo as maintainer.

@qe-core I guess with these suggestions this should go back to you

Also available in: Atom PDF