action #28648
closed[CaaSP] fails to match the BIOS needle
Added by pgeorgiadis about 7 years ago. Updated over 6 years ago.
100%
Description
Observation¶
openQA test in scenario caasp-2.0-CaaSP-DVD-Incidents-x86_64-QAM-CaaSP-autoyast1@qam-caasp_x86_64 fails in
installation
Reproducible¶
Fails since (at least) Build :6053:velum.1512010207 (current job)
Expected result¶
Last good: :4392:shim.1512002931 (or more recent)
Further details¶
Always latest result in this scenario: latest
Looking for the bios screen at boot just for few seconds, it cannot be reliable. Especially if there are many needles to check, the timeframe for matching the needle is quite small, as a result this time the test failed.
Also, from the discussion in #testing, there's a proposal for removing the unnecessary needles.
Updated by pgeorgiadis about 7 years ago
This keeps happening again and again when the openqa worker is over-loaded. Especially in my local instance, which is not that powerful, I am hitting this in in almost in every run. One of the nodes will fail to acknowledge that the installation is finished and it will fail the whole job :/
What about adding 'https://openqa.suse.de/tests/1280957#step/first_boot/1' as the last photo of the installation test?
or another proposal: what about setting how long the BIOS' splash screen in qemu is shown?
Updated by cyberiad about 7 years ago
I don't fully understand how crucial this test for CaaSP in general is, but I think this shouldn't block our current tests. Let's try to sort it out when the experts for this are back.
Updated by pgeorgiadis almost 7 years ago
@cyberiad I've just pushed a PR that fixes our problem with autoyast and too much not specific needles.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4105
Updated by pgeorgiadis almost 7 years ago
I've changed my mind and I completely removed the autoyast test from the qam-caasp scenario. We are now checking directly the boot-screen after the autoyast installation. In worst case scenario this will fail in 20 minutes (in case there's an error in autoyast).
New PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4114
Updated by okurz almost 7 years ago
In the non-autoyast tests we disable the grub timeout so we have all the time in the world to check the boot menu against needles. We use the test module "tests/installation/disable_grub_timeout.pm" for that. I guess the same idea should be applicable for autoyast tests as well: Disable the grub timeout before booting into the installed system. For autoyast test in particular this means adjust the autoyast profiles accordingly. I don't know by heart but I am convinced that configuring that in autoyast profiles is possible.
Updated by pgeorgiadis almost 7 years ago
Pff, the PR is still not solving the problem. Now it doesn't fail in the BIOS screen, but in the Grub screen.
Oliver, thanks for the idea, but unfortunately I don't think we should modify the autoyast profile. This gets generated after the installation of the admin node, and the idea behind is that all the other nodes can use this profile afterwards in order to be 'clients/members' of this admin. So, modifying it not something that the customer should do, this I would prefer to avoid touching it.
Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?
Updated by okurz almost 7 years ago
pgeorgiadis wrote:
Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?
I don't see a use case for that. You always wait for something. So … what are you waiting for?
Updated by pgeorgiadis almost 7 years ago
okurz wrote:
pgeorgiadis wrote:
Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?
I don't see a use case for that. You always wait for something. So … what are you waiting for?
Very good question. I am waiting for the root login screen, which is a static thing. I've updated the PR in GitHub and I have tested it ~20 times.
http://skyrim.qam.suse.de/group_overview/98?limit_builds=50 (see the Build:6053:velum.[2-20])
Going there you are going to see failures. This is because of a real bug (race condition). I am going to file a bug, a soon as this PR gets merged.
In the question: why qam differs from qa tests, the answer is the following:
Hardware: We do not have the same hardware resources, thus our tests are ending up with failures because of overloaded workers. Also, apart from the production, in development, we do not have the luxury to have the same resources, thus a simple workstation cannot deal with a 11-nodes tests of CaaSP.
Focus: QAM focuses on the maintenance incident and the effect of this in the system. We take for granted that the GM installation/boot should work.
As a result, optimizing (2) to resolve (1) is what this PR is all about. ofc it would be better to have more hardware, but this is not going to happen soon enough, while in the meantime there are bugs which are not happening in openQA because of (1).
Updated by mkravec almost 7 years ago
This issue is present on multiple places, for example transactional-update reboots fail here quite often:
https://openqa.suse.de/tests/1353571#step/transactional_update/27
I think that long-term we should find more systematic solution. We can try:
- debug (ask tools team) where is "Stall detected" coming from https://openqa.suse.de/tests/1353793#step/rebootmgr/33
- not share workers (openqaworker8/9) with 64bit
- limit number of workers on physical machine (currently 24)
About low-spec hardware:
- book something from orthos, it's hell to do it on local machine. I am using mars (https://orthos.arch.suse.de/index.php?host=3109) - spec is enough for 6 workers
- we have dedicated HW (openqaworker8/9) for CaaSP in openQA, we should make it reliable - task for new year :)
Updated by mkravec almost 7 years ago
I updated openQA QAM configuration:
Added to admin node:
QEMUCPUS=4
QEMURAM=8192
Default for all nodes:
QEMUCPUS=1
QEMURAM=4096
If that does not help we should lower worker count per machine.
Please observe and ping me if something comes up.
Updated by pgeorgiadis almost 7 years ago
As we discussed it with Martin K, we decreased the number nodes (from 12 to 8) in order to make the system able to breath a little better. So, now in my dev environment, everything seems to be workings atm, while at the production I have failures (see https://openqa.suse.de/tests/1363784#) even with 5 node cluster (which already is smaller than 8). As a result, we need to do something about it. @mkravec, what ideas do you have?
Updated by mkravec almost 7 years ago
Our debugging showed that patches applied during admin node installation broke velum (it was extremely slow).
Not accepting updates during installation worked fine, QAM team will file a bug.
Updated by mkravec almost 7 years ago
- Status changed from Resolved to In Progress
- Assignee set to mkravec
Spotted again in https://openqa.suse.de/tests/1554587
Updated by mkravec almost 7 years ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 100
Updated by mkravec over 6 years ago
- Status changed from Feedback to Resolved
We did not have this issue for some time now, I think it's fixed :)