Project

General

Profile

Actions

action #39974

closed

[openqa][PARALLEL_WITH] Child job failure makes parent job terminated.

Added by xlai over 5 years ago. Updated over 5 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
2018-08-20
Due date:
% Done:

0%

Estimated time:

Description

I have two jobs with relationship of PARALLEL_WITH, when the child job finished as failed, parent job got TERM soon, and not finished the other codes left, which makes it impossible to upload failure logs on parent job.

Relationship of the two jobs: PARALLEL_WITH

Key code on parent job:
mutex_create('DST_READY_TO_START'); // after this , child starts core test code
wait_for_children;
#upload logs
script_run("xl dmesg > /tmp/xl-dmesg.log"); // got TERM from os-autoinst log, not finished following
my $logs = "/var/log/libvirt /var/log/messages /var/log/xen /var/lib/xen/dump /tmp/xl-dmesg.log";
&virt_autotest_base::upload_virt_logs($logs, "guest-migration-dst-logs");

Key log:
CHILD JOB: http://10.67.18.220/tests/259, normally failed
PARENT: http://10.67.18.220/tests/258/file/autoinst-log.txt
PARENT KEY LOG:
[2018-08-17T18:40:18.0074 CST] [debug] Waiting for 1 jobs to finish
[2018-08-17T18:40:19.0096 CST] [debug] Waiting for 1 jobs to finish
[2018-08-17T18:40:20.0121 CST] [debug] Waiting for 0 jobs to finish
[2018-08-17T18:40:20.0121 CST] [debug] /var/lib/openqa/share/tests/sle-12-SP4/tests/virt_autotest/guest_migration_dst.pm:49 called testapi::script_run
[2018-08-17T18:40:20.0121 CST] [debug] <<< testapi::script_run(cmd='xl dmesg > /tmp/xl-dmesg.log', wait=undef)
[2018-08-17T18:40:20.0121 CST] [debug] /var/lib/openqa/share/tests/sle-12-SP4/tests/virt_autotest/guest_migration_dst.pm:49 called testapi::script_run
[2018-08-17T18:40:20.0122 CST] [debug] <<< testapi::type_string(string='xl dmesg > /tmp/xl-dmesg.log', max_interval=250, wait_screen_changes=0, wait_still_screen=0)
BYTES {"json_cmd_token":"kCadcRoy","type_string":{"max_interval":250,"text":"xl dmesg > /tmp/xl-dmesg.log","json_cmd_token":"EGgaJsza"}}
[2018-08-17T18:40:20.0615 CST] [debug] backend got TERM
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":39339"
after 2679 requests (2522 known processed) with 0 events remaining.
[2018-08-17T18:40:20.0617 CST] [info] Collected unknown process with pid 17212 and exit status: 1
[2018-08-17T18:40:20.0617 CST] [debug] autotest received signal TERM, saving results of current test before exiting
[2018-08-17T18:40:20.0618 CST] [debug] signalhandler got TERM - loop 1
[2018-08-17T18:40:20.0618 CST] [debug] awaiting death of commands process
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":60734"
after 2729 requests (2729 known processed) with 0 events remaining.
[2018-08-17T18:40:20.0624 CST] [debug] tests died
[2018-08-17T18:40:20.0624 CST] [info] Collected unknown process with pid 17309 and exit status: 1
[2018-08-17T18:40:20.0625 CST] [info] Collected unknown process with pid 17311 and exit status: 15
[2018-08-17T18:40:20.0625 CST] [info] Collected unknown process with pid 17313 and exit status: 0
[2018-08-17T18:40:20.0625 CST] [info] Collected unknown process with pid 17314 and exit status: 255
[2018-08-17T18:40:20.0626 CST] [debug] signalhandler got TERM - loop 0
[2018-08-17T18:40:20.0626 CST] [debug] killing backend process 16929
[2018-08-17T18:40:20.0626 CST] [info] Collected unknown process with pid 17214 and exit status: 15
[2018-08-17T18:40:20.0627 CST] [info] Collected unknown process with pid 17216 and exit status: 0
[2018-08-17T18:40:20.0970 CST] [info] Collected unknown process with pid 16930 and exit status: 0
[2018-08-17T18:40:20.0970 CST] [info] Collected unknown process with pid 16963 and exit status: 0
[2018-08-17T18:40:20.0970 CST] [info] Collected unknown process with pid 17076 and exit status: 0
[2018-08-17T18:40:20.0971 CST] [info] Collected unknown process with pid 17204 and exit status: 0
[2018-08-17T18:40:20.0971 CST] [info] Collected unknown process with pid 17301 and exit status: 0
[2018-08-17T18:40:20.0972 CST] [info] Collected unknown process with pid 20001 and exit status: 0
[2018-08-17T18:40:20.0975 CST] [debug] done with backend process
[2018-08-17T18:40:20.0982 CST] [info] Isotovideo exit status: 1
[2018-08-17T18:40:20.0983 CST] [info] +++ worker notes +++
[2018-08-17T18:40:20.0983 CST] [info] end time: 2018-08-17 10:40:20
[2018-08-17T18:40:20.0983 CST] [info] result: cancel

Actions #1

Updated by xlai over 5 years ago

This kind of blocks our guest migration integration to osd. Hope it can be fixed.

Actions #2

Updated by xlai over 5 years ago

openqa debug log

Actions #3

Updated by xlai over 5 years ago

@szarate, would you please help to or help to assign to someone familiar with multi-machine jobs to confirm firstly whether this is expected behavior on ipmi or a issue? From wei, on qemu workers, their migration jobs can successfully upload logs from parent when child fail. But on ipmi, ours can not.

Actions #4

Updated by szarate over 5 years ago

@alice, Yes, we'll help and look at this, We haven't had the chance yet to actually look at it.

Actions #5

Updated by xlai over 5 years ago

szarate wrote:

@alice, Yes, we'll help and look at this, We haven't had the chance yet to actually look at it.

Got it. Thanks for the reply. Look forward to the debugging result! Thanks for the help!

Actions #6

Updated by coolo over 5 years ago

  • Status changed from New to Rejected

Please read https://github.com/os-autoinst/openQA/blob/master/docs/WritingTests.asciidoc#job-dependencies - you misinterprete what a parent is. The parent is only needed until the child is dead (not in real life, just in openQA). If you want to synchronize them, you need to use mmapi functions

Actions #7

Updated by xlai over 5 years ago

coolo wrote:

Please read https://github.com/os-autoinst/openQA/blob/master/docs/WritingTests.asciidoc#job-dependencies - you misinterprete what a parent is. The parent is only needed until the child is dead (not in real life, just in openQA). If you want to synchronize them, you need to use mmapi functions

Thanks for confirming it. I will change our code logic then.

Actions

Also available in: Atom PDF