Project

General

Profile

Actions

action #15378

closed

ppc worker died somewhere in the middle

Added by okurz over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Immediate
Assignee:
-
Category:
Regressions/Crashes
Target version:
-
Start date:
2016-12-08
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

openQA test in scenario sle-12-SP3-Server-DVD-ppc64le-om_proxyscc_sles12sp2_sdk_full_update_by_yast_ppc@ppc64le fails in
bootloader_ofw to show any content from the virtual machine. It looks like the virtual machine never started or was already shutdown at this time.

https://openqa.suse.de/tests/659770/file/autoinst-log.txt does not show anything obvious.

logs from worker on malbec at that time:

Dec 08 09:34:30 malbec worker[162548]: setting job 659232 to done
Dec 08 09:34:30 malbec worker[162548]: got job 659770: 00659770-sle-12-SP3-Server-DVD-ppc64le-Build0185-om_proxyscc_sles12sp2_sdk_full_update_by_yast_ppc@ppc64le
Dec 08 09:34:30 malbec worker[162548]: 47042: WORKING 659770
Dec 08 09:35:16 malbec worker[162548]: can't open /var/lib/openqa/pool/3/testresults/result-online_migration_setup.json: No such file or directory at /usr/share/openqa/script/../lib/OpenQA/Worker/Jobs.pm line 626.
Dec 08 09:35:21 malbec worker[162548]: cleaning up 00659770-sle-12-SP3-Server-DVD-ppc64le-Build0185-om_proxyscc_sles12sp2_sdk_full_update_by_yast_ppc@ppc64le...
Dec 08 09:35:21 malbec worker[162548]: Received command livelog_stop for job 659770, but we do not have any assigned. Ignoring!
Dec 08 09:36:48 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:37:21 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:37:38 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:38:50 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:39:18 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:41:18 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:48:40 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:48:55 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:49:33 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:50:14 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:50:27 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:51:10 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:56:35 malbec worker[162548]: Use of uninitialized value $op in numeric eq (==) at /usr/lib/perl5/vendor_perl/5.18.2/Mojo/Transaction/WebSocket.pm line 110.
Dec 08 09:56:35 malbec worker[162548]: setting job 659770 to done
Dec 08 09:56:35 malbec worker[162548]: got job 659814: 00659814-sle-12-SP2-Server-DVD-Incidents-ppc64le-Build:125282.perl-DBD-mysql.3690-qam-minimal-full@ppc64le
Dec 08 09:56:35 malbec worker[162548]: 47727: WORKING 659814

Problem

H1. we might not have enough logging output from the worker to understand what could have happened here.
H2. worker dead detection?

Further details

Always latest result in this scenario: latest


Subtasks 1 (0 open1 closed)

openQA Tests - action #15396: malbec unusable (see parent ticket)Resolvedmgriessmeier2016-12-08

Actions
Actions #1

Updated by okurz over 7 years ago

  • Priority changed from Normal to High

It's not just happening once: https://openqa.suse.de/tests/658864#step/bootloader_ofw/2 and https://openqa.suse.de/tests/658866#step/bootloader_ofw/2

The job should probably be reported as incomplete, not "failed". why is that?

Actions #2

Updated by okurz over 7 years ago

  • Priority changed from High to Urgent

And it's reproducible a lot: https://openqa.suse.de/tests/659899

Actions #3

Updated by mkittler over 7 years ago

  • Subject changed from ppc worker did somewhere in the middle to ppc worker died somewhere in the middle
Actions #4

Updated by mkittler over 7 years ago

to H2: Then we would receive warnings like https://progress.opensuse.org/issues/13952 in web UI log.

Actions #5

Updated by okurz over 7 years ago

  • Status changed from New to In Progress

mkittler wrote:

to H2: Then we would receive warnings like https://progress.opensuse.org/issues/13952 in web UI log.

Who said we don't have? I did not look for the logs from the web ui at around this time but I did now and could not find anything wrong regarding job 659770.

@mgriessmeier can you help to better understand why we did not see the issue in the logs properly as you found it in #15396?

Actions #6

Updated by coolo over 7 years ago

  • Status changed from In Progress to Closed

"match=bootloader-ofw timed out after 15" doesn't sound like 'in the middle'. As this immediate issue is now untouched for 10 days, I assume the panic faded off and we can close this

Actions #7

Updated by okurz over 7 years ago

  • Status changed from Closed to In Progress

The child ticket #15396 was more about the "immediate" panic but this ticket still should be open (can't change prio right now?!?). I was saying "in the middle" because os-autoinst was just acting super-stupid here. It's like if some DAU calls you that he can not see anything on his screen and after some debugging over the phone it turns out that the display was stolen. In all the tests failing linked to this ticket the problem was that the firmware was not loadable at all and the qemu instance did not display anything. I guess this can still be improved.

Actions #8

Updated by coolo over 7 years ago

  • Status changed from In Progress to Closed

then please create a new low priority issue - immediate priority issues are locked to it.

Actions

Also available in: Atom PDF