action #101882: [qe-core] aarch64 workers: test fails in patch_and_reboot - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #101882

open

[qe-core] aarch64 workers: test fails in patch_and_reboot

Added by mgrifalconi over 3 years ago. Updated 12 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Bugs in existing tests

Target version:

Start date:

Due date:

% Done:

Estimated time:

Difficulty:

Description

Observation¶

Test expects to find the bootloader needle ( https://openqa.suse.de/tests/7588098#step/patch_and_reboot/52 ) but somehow it is not picked by openqa and it fails.
We could at the same time check for the login needle and skip the need of the bootloader if not found I suppose.

openQA test in scenario sle-15-Server-DVD-Updates-aarch64-qam-gnome@aarch64-virtio fails in
patch_and_reboot

Test suite description¶

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible¶

Fails since (at least) Build 20211103-1

Expected result¶

Last good: 20211102-1 (or more recent)

Further details¶

Always latest result in this scenario: latest

Actions

Copy link

Updated by dzedro over 3 years ago

This is aarch64, similar issues happen also on another tests and I don't think this does have anything with test,
but with aarch64 worker, the job just will get stuck, high load or I/O don't know what exactly.
Another case is zypper fail when it does time out, because the process just hanged.

Actions

Copy link

Updated by tjyrinki_suse over 3 years ago

Subject changed from [qe-core] test fails in patch_and_reboot to aarch64 workers: test fails in patch_and_reboot
Start date deleted (~~2021-11-03~~)

Adjusting according to previous comment. Maybe loads at aarch64 workers should be more limited to be lower?

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from aarch64 workers: test fails in patch_and_reboot to [qe-core] aarch64 workers: test fails in patch_and_reboot

tjyrinki_suse wrote:

Adjusting according to previous comment.

so who should pick it?

Maybe loads at aarch64 workers should be more limited to be lower?

Well, the load would be more limited if we run even less openQA worker instances on the hosts. All of that is configured in https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls however we already reduced a lot from what the machines should theoretically be capable of. And the current machines openqaworker-arm-[123] don't really suffer from high load, see e.g. https://monitor.qa.suse.de/d/WDopenqaworker-arm-3/worker-dashboard-openqaworker-arm-3?viewPanel=65092&orgId=1&refresh=1m&from=now-90d&to=now and https://monitor.qa.suse.de/d/WDopenqaworker-arm-2/worker-dashboard-openqaworker-arm-2?viewPanel=65092&orgId=1&refresh=1m&from=now-30d&to=now and https://monitor.qa.suse.de/d/WDopenqaworker-arm-1/worker-dashboard-openqaworker-arm-1?viewPanel=65092&orgId=1&refresh=1m&from=now-30d&to=now where even the CPU usage spikes are well below 40%. There are two new machines openqaworker-arm-4 and openqaworker-arm-5 which make their own kind of problems and have been removed from production again 2 months ago, see #101048 . You might benefit from comparing the stability of tests with the aarch64 machine we have in the o3 network. That one seems to be much more stable, don't know why. ggardet might also be of help, you can reach out to ggardet e.g. in #opensuse-factory

Actions

Copy link

Updated by tjyrinki_suse about 3 years ago

Subject changed from [qe-core] aarch64 workers: test fails in patch_and_reboot to [infrastructure] aarch64 workers: test fails in patch_and_reboot
Category changed from Bugs in existing tests to Infrastructure
Target version set to Ready

I was meaning tools team as this seems like an infrastructure issue, maybe these changes make it imply that better.

Currently I do not have issues how the tests could be made more stable as they are working for other architectures.

Actions

Copy link

Updated by tjyrinki_suse about 3 years ago

Priority changed from High to Normal

I think is rare enough though not to be high priority.

Actions

Copy link

Updated by okurz about 3 years ago

Subject changed from [infrastructure] aarch64 workers: test fails in patch_and_reboot to [qe-core] aarch64 workers: test fails in patch_and_reboot
Category changed from Infrastructure to Bugs in existing tests
Target version deleted (~~Ready~~)

We looked at the problem during the SUSE QE Tools workshop session 2021-12-10. We checked the referenced scenario from the ticket description https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Server-DVD-Updates&machine=aarch64-virtio&test=qam-gnome&version=15 and in the 38 available test results we found two occurences of "patch_and_reboot" failing. https://openqa.suse.de/tests/7798294#step/patch_and_reboot/62 shows a zypper conflict so certainly not related to "infrastructure" or aarch64 worker performance specific. The other case is https://openqa.suse.de/tests/7596208#step/patch_and_reboot/55 which shows that the test expected to see the grub bootloader but we find the gdm login window instead. https://openqa.suse.de/tests/7596208/file/patch_and_reboot-dmesg.log shows that an actual reboot has happened in between. https://openqa.suse.de/tests/7596208/logfile?filename=autoinst-log.txt shows

Suggestions¶

Fix "WARNING: check_asserted_screen took 0.83 seconds for 14 candidate needles - make your needles more specific" from autoinst-log.txt to have a higher chance of matching within a reasonable time -> Delete unused, obsolete, duplicate needles
Videos can help for debugging -> Consider splitting the scenarios or explicitly
Fix the bootloader check race condition: This can be solved in multiple ways a) Use the same approach as in installation jobs to disable the grub timeout so that the boot menu will stay until explicitly handled b) don't expect the bootloader menu if it's not needed to stop there
Update the maintainer of patch_and_reboot. It still has coolo as maintainer.

@qe-core I guess with these suggestions this should go back to you

Actions

Copy link

Updated by slo-gin about 2 years ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

Updated by slo-gin 12 months ago

This ticket was set to Normal priority but was not updated within the SLO period. Please consider picking up this ticket or just set the ticket to the next lower priority.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #101882

[qe-core] aarch64 workers: test fails in patch_and_reboot

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by dzedro over 3 years ago

Updated by tjyrinki_suse over 3 years ago

Updated by okurz over 3 years ago

Updated by tjyrinki_suse about 3 years ago

Updated by tjyrinki_suse about 3 years ago

Updated by okurz about 3 years ago

Suggestions¶

Updated by slo-gin about 2 years ago

Updated by slo-gin 12 months ago