Project

General

Profile

action #28648

[CaaSP] fails to match the BIOS needle

Added by pgeorgiadis about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2017-11-30
Due date:
% Done:

100%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario caasp-2.0-CaaSP-DVD-Incidents-x86_64-QAM-CaaSP-autoyast1@qam-caasp_x86_64 fails in
installation

Reproducible

Fails since (at least) Build :6053:velum.1512010207 (current job)

Expected result

Last good: :4392:shim.1512002931 (or more recent)

Further details

Always latest result in this scenario: latest

Looking for the bios screen at boot just for few seconds, it cannot be reliable. Especially if there are many needles to check, the timeframe for matching the needle is quite small, as a result this time the test failed.
Also, from the discussion in #testing, there's a proposal for removing the unnecessary needles.

History

#1 Updated by pgeorgiadis about 5 years ago

This keeps happening again and again when the openqa worker is over-loaded. Especially in my local instance, which is not that powerful, I am hitting this in in almost in every run. One of the nodes will fail to acknowledge that the installation is finished and it will fail the whole job :/

What about adding 'https://openqa.suse.de/tests/1280957#step/first_boot/1' as the last photo of the installation test?
or another proposal: what about setting how long the BIOS' splash screen in qemu is shown?

#2 Updated by cyberiad about 5 years ago

I don't fully understand how crucial this test for CaaSP in general is, but I think this shouldn't block our current tests. Let's try to sort it out when the experts for this are back.

#3 Updated by pgeorgiadis about 5 years ago

cyberiad I've just pushed a PR that fixes our problem with autoyast and too much not specific needles.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4105

#4 Updated by pgeorgiadis about 5 years ago

I've changed my mind and I completely removed the autoyast test from the qam-caasp scenario. We are now checking directly the boot-screen after the autoyast installation. In worst case scenario this will fail in 20 minutes (in case there's an error in autoyast).

New PR: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4114

#5 Updated by okurz about 5 years ago

In the non-autoyast tests we disable the grub timeout so we have all the time in the world to check the boot menu against needles. We use the test module "tests/installation/disable_grub_timeout.pm" for that. I guess the same idea should be applicable for autoyast tests as well: Disable the grub timeout before booting into the installed system. For autoyast test in particular this means adjust the autoyast profiles accordingly. I don't know by heart but I am convinced that configuring that in autoyast profiles is possible.

#6 Updated by pgeorgiadis about 5 years ago

Pff, the PR is still not solving the problem. Now it doesn't fail in the BIOS screen, but in the Grub screen.

Oliver, thanks for the idea, but unfortunately I don't think we should modify the autoyast profile. This gets generated after the installation of the admin node, and the idea behind is that all the other nodes can use this profile afterwards in order to be 'clients/members' of this admin. So, modifying it not something that the customer should do, this I would prefer to avoid touching it.

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

#7 Updated by okurz about 5 years ago

pgeorgiadis wrote:

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

I don't see a use case for that. You always wait for something. So … what are you waiting for?

#8 Updated by pgeorgiadis about 5 years ago

okurz wrote:

pgeorgiadis wrote:

Is there any way to tell openQA: Do not do anything for 20 minutes. Just wait 20 minutes idle-ing?

I don't see a use case for that. You always wait for something. So … what are you waiting for?

Very good question. I am waiting for the root login screen, which is a static thing. I've updated the PR in GitHub and I have tested it ~20 times.
http://skyrim.qam.suse.de/group_overview/98?limit_builds=50 (see the Build:6053:velum.[2-20])

Going there you are going to see failures. This is because of a real bug (race condition). I am going to file a bug, a soon as this PR gets merged.

In the question: why qam differs from qa tests, the answer is the following:

  1. Hardware: We do not have the same hardware resources, thus our tests are ending up with failures because of overloaded workers. Also, apart from the production, in development, we do not have the luxury to have the same resources, thus a simple workstation cannot deal with a 11-nodes tests of CaaSP.

  2. Focus: QAM focuses on the maintenance incident and the effect of this in the system. We take for granted that the GM installation/boot should work.

As a result, optimizing (2) to resolve (1) is what this PR is all about. ofc it would be better to have more hardware, but this is not going to happen soon enough, while in the meantime there are bugs which are not happening in openQA because of (1).

#9 Updated by mkravec about 5 years ago

This issue is present on multiple places, for example transactional-update reboots fail here quite often:
https://openqa.suse.de/tests/1353571#step/transactional_update/27

I think that long-term we should find more systematic solution. We can try:

About low-spec hardware:

  • book something from orthos, it's hell to do it on local machine. I am using mars (https://orthos.arch.suse.de/index.php?host=3109) - spec is enough for 6 workers
  • we have dedicated HW (openqaworker8/9) for CaaSP in openQA, we should make it reliable - task for new year :)

#10 Updated by mkravec about 5 years ago

I updated openQA QAM configuration:

Added to admin node:
QEMUCPUS=4
QEMURAM=8192

Default for all nodes:
QEMUCPUS=1
QEMURAM=4096

If that does not help we should lower worker count per machine.
Please observe and ping me if something comes up.

#11 Updated by pgeorgiadis about 5 years ago

As we discussed it with Martin K, we decreased the number nodes (from 12 to 8) in order to make the system able to breath a little better. So, now in my dev environment, everything seems to be workings atm, while at the production I have failures (see https://openqa.suse.de/tests/1363784#) even with 5 node cluster (which already is smaller than 8). As a result, we need to do something about it. mkravec, what ideas do you have?

#12 Updated by mkravec about 5 years ago

Our debugging showed that patches applied during admin node installation broke velum (it was extremely slow).

Not accepting updates during installation worked fine, QAM team will file a bug.

#13 Updated by pgeorgiadis about 5 years ago

  • Status changed from New to Resolved

#14 Updated by mkravec almost 5 years ago

  • Status changed from Resolved to In Progress
  • Assignee set to mkravec

#15 Updated by mkravec almost 5 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100

#16 Updated by mkravec almost 5 years ago

  • Status changed from Feedback to Resolved

We did not have this issue for some time now, I think it's fixed :)

Also available in: Atom PDF