action #28648: [CaaSP] fails to match the BIOS needle - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #28648

closed

[CaaSP] fails to match the BIOS needle

Added by pgeorgiadis over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkravec

Category:

Bugs in existing tests

Target version:

Start date:

2017-11-30

Due date:

% Done:

100%

Estimated time:

Difficulty:

Description

Observation¶

openQA test in scenario caasp-2.0-CaaSP-DVD-Incidents-x86_64-QAM-CaaSP-autoyast1@qam-caasp_x86_64 fails in
installation

Reproducible¶

Fails since (at least) Build :6053:velum.1512010207 (current job)

Expected result¶

Last good: :4392:shim.1512002931 (or more recent)

Further details¶

Always latest result in this scenario: latest

Looking for the bios screen at boot just for few seconds, it cannot be reliable. Especially if there are many needles to check, the timeframe for matching the needle is quite small, as a result this time the test failed.
Also, from the discussion in #testing, there's a proposal for removing the unnecessary needles.

Actions

Copy link

Updated by pgeorgiadis over 7 years ago

This keeps happening again and again when the openqa worker is over-loaded. Especially in my local instance, which is not that powerful, I am hitting this in in almost in every run. One of the nodes will fail to acknowledge that the installation is finished and it will fail the whole job :/

What about adding 'https://openqa.suse.de/tests/1280957#step/first_boot/1' as the last photo of the installation test?
or another proposal: what about setting how long the BIOS' splash screen in qemu is shown?

Actions

Copy link

Updated by cyberiad over 7 years ago

I don't fully understand how crucial this test for CaaSP in general is, but I think this shouldn't block our current tests. Let's try to sort it out when the experts for this are back.

Actions

Copy link

Updated by pgeorgiadis over 7 years ago

@cyberiad I've just pushed a PR that fixes our problem with autoyast and too much not specific needles.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4105

Actions

Copy link

Updated by pgeorgiadis over 7 years ago

I've changed my mind and I completely removed the autoyast test from the qam-caasp scenario. We are now checking directly the boot-screen after the autoyast installation. In worst case scenario this will fail in 20 minutes (in case there's an error in autoyast).

New PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4114

Actions

Copy link

Updated by okurz over 7 years ago

In the non-autoyast tests we disable the grub timeout so we have all the time in the world to check the boot menu against needles. We use the test module "tests/installation/disable_grub_timeout.pm" for that. I guess the same idea should be applicable for autoyast tests as well: Disable the grub timeout before booting into the installed system. For autoyast test in particular this means adjust the autoyast profiles accordingly. I don't know by heart but I am convinced that configuring that in autoyast profiles is possible.

Actions

Copy link

Updated by pgeorgiadis over 7 years ago

Pff, the PR is still not solving the problem. Now it doesn't fail in the BIOS screen, but in the Grub screen.

Oliver, thanks for the idea, but unfortunately I don't think we should modify the autoyast profile. This gets generated after the installation of the admin node, and the idea behind is that all the other nodes can use this profile afterwards in order to be 'clients/members' of this admin. So, modifying it not something that the customer should do, this I would prefer to avoid touching it.

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

Actions

Copy link

Updated by okurz over 7 years ago

pgeorgiadis wrote:

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

I don't see a use case for that. You always wait for something. So … what are you waiting for?

Actions

Copy link

Updated by pgeorgiadis over 7 years ago

okurz wrote:

pgeorgiadis wrote:

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

I don't see a use case for that. You always wait for something. So … what are you waiting for?

Very good question. I am waiting for the root login screen, which is a static thing. I've updated the PR in GitHub and I have tested it ~20 times.
http://skyrim.qam.suse.de/group_overview/98?limit_builds=50 (see the Build:6053:velum.[2-20])

Going there you are going to see failures. This is because of a real bug (race condition). I am going to file a bug, a soon as this PR gets merged.

In the question: why qam differs from qa tests, the answer is the following:

Hardware: We do not have the same hardware resources, thus our tests are ending up with failures because of overloaded workers. Also, apart from the production, in development, we do not have the luxury to have the same resources, thus a simple workstation cannot deal with a 11-nodes tests of CaaSP.
Focus: QAM focuses on the maintenance incident and the effect of this in the system. We take for granted that the GM installation/boot should work.

As a result, optimizing (2) to resolve (1) is what this PR is all about. ofc it would be better to have more hardware, but this is not going to happen soon enough, while in the meantime there are bugs which are not happening in openQA because of (1).

Actions

Copy link

Updated by mkravec over 7 years ago

This issue is present on multiple places, for example transactional-update reboots fail here quite often:
https://openqa.suse.de/tests/1353571#step/transactional_update/27

I think that long-term we should find more systematic solution. We can try:

debug (ask tools team) where is "Stall detected" coming from https://openqa.suse.de/tests/1353793#step/rebootmgr/33
not share workers (openqaworker8/9) with 64bit
limit number of workers on physical machine (currently 24)

About low-spec hardware:

book something from orthos, it's hell to do it on local machine. I am using mars (https://orthos.arch.suse.de/index.php?host=3109) - spec is enough for 6 workers
we have dedicated HW (openqaworker8/9) for CaaSP in openQA, we should make it reliable - task for new year :)

Actions

Copy link

#10

Updated by mkravec over 7 years ago

I updated openQA QAM configuration:

Added to admin node:
QEMUCPUS=4
QEMURAM=8192

Default for all nodes:
QEMUCPUS=1
QEMURAM=4096

If that does not help we should lower worker count per machine.
Please observe and ping me if something comes up.

Actions

Copy link

#11

Updated by pgeorgiadis over 7 years ago

As we discussed it with Martin K, we decreased the number nodes (from 12 to 8) in order to make the system able to breath a little better. So, now in my dev environment, everything seems to be workings atm, while at the production I have failures (see https://openqa.suse.de/tests/1363784#) even with 5 node cluster (which already is smaller than 8). As a result, we need to do something about it. @mkravec, what ideas do you have?

Actions

Copy link

#12