Project

General

Profile

Actions

action #31444

closed

[kernel][tools][aarch64][ppc64le][s390x] Tests sometimes fail to boot

Added by metan about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
2018-02-07
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

This seems to be the same problem as action #27570, it's reproducible basically in each kernel run for at least last month, just look at the tests that fail in boot_ltp while looking for grub2 needle.

[Santiago knows about this; I'm opening this issue so that we can track the progress here.]

Update: the same happens for other non-intel architectures (maybe different reason for each architecture)

Actions #1

Updated by szarate about 6 years ago

  • Project changed from openQA Tests to openQA Project
  • Category set to 132
  • Assignee deleted (dasantiago)

There's a PR created by Thomas a while ago, which needs some love, also https://github.com/os-autoinst/os-autoinst/pull/754

With a quick look, I saw that you have two problems:

One is the system just not booting (or taking too long?)... If the purple cursor could be needled and if found increase the wait timeout * 2? (and marked as a workarround): https://openqa.suse.de/tests/1449548

On the other hand, https://openqa.suse.de/tests/1449482 has https://progress.opensuse.org/issues/30613 which is also WIP by oli

Actions #2

Updated by okurz about 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_numa
https://openqa.suse.de/tests/1498314

Actions #3

Updated by okurz about 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_numa
https://openqa.suse.de/tests/1534754

Actions #4

Updated by coolo about 6 years ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (132)
  • Assignee set to metan

the test never boots - and to avoid timeouts, disable the GRUB timeout during install. the thunderX machines are just too slow to match 15 needles in 6 seconds - and test writers are just incapable to make the grub needs more specific than that :(

Actions #5

Updated by metan about 6 years ago

  • Assignee changed from metan to rpalethorpe

@rpalenthorpe I suppose that's the wait_boot() call in boot_ltp.pm.

Actions #6

Updated by rpalethorpe about 6 years ago

  • Status changed from New to In Progress

There is still some issue with script_output which I wrote about in Oliver's ticket, but it looks quite rare and there is probably some simple fix for it.

As for the other boot failures they look a bit more serious. When it boots successfully the purple cursor only shows for a few frames in the video, but when it fails it obviously is stuck for much longer. Also if you look in serial0.txt there is a message from either the Kernel, Grub or firmware saying:

Synchronous Exception at 0x00000000785111C4

It looks like it is the firmware saying this and it is possibly caused by Grub or the firmware.
https://developer.arm.com/products/architecture/a-profile/docs/100933/latest/synchronous-and-asynchronous-exceptions

Actions #8

Updated by rpalethorpe about 6 years ago

  • Status changed from In Progress to Blocked
Actions #9

Updated by okurz about 6 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ltp_controllers
https://openqa.suse.de/tests/1570898

Actions #10

Updated by rpalethorpe about 6 years ago

Developer thinks that the problem is more likely to be in the OVMF firmware than Grub. As far as I can see we take the firmware from the host. We may need to install a modified version on the workers to find the problem.

Actions #11

Updated by rpalethorpe about 6 years ago

Can someone please check that the workers are up to date and the correct firmware is being used? We should be getting more information than what is in the serial file: https://bugzilla.opensuse.org/show_bug.cgi?id=1085061#c11

Actions #12

Updated by rpalethorpe about 6 years ago

  • Assignee changed from rpalethorpe to szarate

I am not sure whose responsibility it is to check that the workers are up to date, so I will just assign to Santi :-).

Actions #13

Updated by szarate about 6 years ago

@Richie I added a comment on the ticket, workers are running version 2.9.1-6.9.2, which is a bit old (Just after spectre).

We might update everything after RC3. But My guess would be that it is not related. Anywho... let's see, I'll keep an eye on the BSC.

Actions #14

Updated by szarate about 6 years ago

  • Assignee changed from szarate to rpalethorpe

@Richie, perhaps you can try on a Cavium starting with the AAVMF firmware that comes with the SLE12SP3 package?...

check my comment

Actions #15

Updated by pvorel almost 6 years ago

  • Subject changed from aarch64 sometimes fail to boot to [kernel][tools][aarch64][ppc64le][s390x] Tests sometimes fail to boot
  • Description updated (diff)
Actions #16

Updated by rpalethorpe almost 6 years ago

  • Status changed from Blocked to Workable

Why the name change? We have identified that this is ARM specific. So you should probably create some other tickets.

@szarate, maybe I can do it at the same time as testing the QEMU refactor. Or you could just delete whatever you put on the workers after SLE15 is released. Whatever comes first.

Actions #17

Updated by szarate almost 6 years ago

@rpalethorpe: If you can change the jobs to use the alternate bios i've tested already for this week, and if no problems arise then let's just do the switch

Actions #18

Updated by rpalethorpe almost 6 years ago

  • Status changed from Workable to In Progress
Actions #19

Updated by rpalethorpe almost 6 years ago

  • Status changed from In Progress to Resolved

I don't see any boot failures on the most recent runs.

Actions

Also available in: Atom PDF