Project

General

Profile

Actions

action #60443

closed

job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong

Added by okurz over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-11-29
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/3641268 first failed due to a problem in tests vs. product behaviour but then the job incompleted somewhere in the middle of the post_fail_hook:

[2019-11-29T01:42:19.595 CET] [debug] <<< testapi::send_key(key='ret', do_wait=0, wait_screen_change=0)
[2019-11-29T01:42:19.996 CET] [debug] <<< testapi::wait_serial(record_output=undef, no_regex=0, regexp='Reached target Shutdown', quiet=undef, timeout=90, expect_not_found=0, buffer_size=undef)
[2019-11-29T01:43:51.136 CET] [debug] >>> testapi::wait_serial: Reached target Shutdown: fail
[2019-11-29T01:43:51.136 CET] [debug] <<< testapi::send_key(key='esc', wait_screen_change=0, do_wait=0)
[2019-11-29T01:43:51.647 CET] [debug] <<< testapi::send_key(key='spc', do_wait=0, wait_screen_change=0)
[2019-11-29T01:43:52.055 CET] [debug] [autotest] process exited: 0
[2019-11-29T01:43:52.056 CET] [debug] terminating command server 77732 because test execution ended
[2019-11-29T01:43:52.056 CET] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20033/WYckk7Uc8qxL6MAU/broadcast
[2019-11-29T01:43:53.066 CET] [debug] commands process exited: 0
[2019-11-29T01:43:54.066 CET] [debug] done with command server
[2019-11-29T01:43:54.066 CET] [debug] stopping autotest process 77744
[2019-11-29T01:43:54.066 CET] [debug] done with autotest process
[2019-11-29T01:43:54.066 CET] [debug] isotovideo failed
[2019-11-29T01:43:54.067 CET] [debug] stopping backend process 77745
[2019-11-29T01:43:55.068 CET] [debug] backend got TERM
[2019-11-29T01:43:55.068 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2019-11-29T01:43:58.068 CET] [debug] done with backend process
77728: EXIT 1
[2019-11-29T01:43:58.0106 CET] [info] [pid:95895] Isotovideo exit status: 1
[2019-11-29T01:44:01.0134 CET] [info] [pid:95895] +++ worker notes +++
[2019-11-29T01:44:01.0134 CET] [info] [pid:95895] End time: 2019-11-29 00:44:01
[2019-11-29T01:44:01.0134 CET] [info] [pid:95895] Result: died

It seems the VM did not reach shutdown in time which one can see in "testapi::wait_serial: Reached target Shutdown: fail" and after the next two key presses of "esc" and "spc" the autotest process exited and the rest follows.

Expected Result

Test should fail not incomplete


Related issues 5 (2 open3 closed)

Related to openQA Project - action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]'New2019-11-17

Actions
Related to openQA Project - coordination #62420: [epic] Distinguish all types of incompletesResolvedokurz2018-12-12

Actions
Related to openQA Project - action #53999: openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobsWorkable2019-07-09

Actions
Has duplicate openQA Project - action #56261: [tools] test fails in yast2_bootloader- isotovideo failed, test execuition interrupted on xenRejectedokurz2019-09-02

Actions
Has duplicate openQA Project - action #60200: "considering VNC stalled, no update for" job incompletes just after VNC stall detection, no clear errorRejectedokurz2019-11-22

Actions
Actions #1

Updated by okurz over 4 years ago

  • Subject changed from job incomplete with "isotovideo failed" but no further details what is wrong to job incomplete with auto_review:"(?s)isotovideo failed.*EXIT 1" but no further details what is wrong
Actions #2

Updated by okurz over 4 years ago

  • Subject changed from job incomplete with auto_review:"(?s)isotovideo failed.*EXIT 1" but no further details what is wrong to job incomplete with auto_review:"(?s).autotest. process exited: 0.*isotovideo failed.*EXIT 1" but no further details what is wrong

Restricting to regex to not match on #60161 cases, e.g. as in https://openqa.suse.de/tests/3772385/file/autoinst-log.txt where it says:

The console 'sut' is not responding (half-open socket?). Make sure the console is reachable or disable stall detection on expected disconnects with '$console->disable_vnc_stalls', for example in case of intended machine shutdown
[2020-01-09T16:22:59.576 CET] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2020-01-09T16:23:01.744 CET] [debug] flushing frames
[2020-01-09T16:23:01.782 CET] [debug] QEMU: QEMU emulator version 3.1.1.1 (openSUSE Leap 15.1)
[2020-01-09T16:23:01.782 CET] [debug] QEMU: Copyright (c) 2003-2018 Fabrice Bellard and the QEMU Project developers
[2020-01-09T16:23:01.782 CET] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 30026 (/usr/bin/isotovideo: backen)
[2020-01-09T16:23:01.783 CET] [debug] sending magic and exit
[2020-01-09T16:23:01.784 CET] [debug] received magic close
[2020-01-09T16:23:01.785 CET] [debug] THERE IS NOTHING TO READ 15 4 3
[2020-01-09T16:23:01.785 CET] [debug] terminating command server 29985 because test execution ended
[2020-01-09T16:23:01.785 CET] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20193/raztdqRUz9de9kg8/broadcast
[2020-01-09T16:23:01.800 CET] [debug] commands process exited: 0
[2020-01-09T16:23:01.800 CET] [debug] done with command server
[2020-01-09T16:23:01.800 CET] [debug] stopping autotest process 30006
[2020-01-09T16:23:01.805 CET] [debug] backend process exited: 0
[2020-01-09T16:23:01.808 CET] [debug] done with autotest process
[2020-01-09T16:23:01.808 CET] [debug] isotovideo failed
[2020-01-09T16:23:01.809 CET] [debug] stopping backend process 30026
[2020-01-09T16:23:01.809 CET] [debug] done with backend process
29979: EXIT 1

so I added the '[autotest] process exited: 0', maybe that's enough.

Actions #3

Updated by okurz over 4 years ago

  • Related to action #59926: test incompletes in middle of execution with auto_review:"Unexpected end of data 0":retry, system journal shows 'kernel: traps: /usr/bin/isotov[2300] general protection ip:7fd5ef11771e sp:7ffe066f2200 error:0 in libc-2.26.so[7fd5ef094000+1b1000]' added
Actions #4

Updated by okurz over 4 years ago

https://openqa.suse.de/tests/3772621/file/autoinst-log.txt was caught by auto-review but it does provide a bit more details:

[2020-01-09T18:04:19.617 CET] [debug] no match: 295.6s, best candidate: displaymanager-sles4sap-nousers-20170117 (0.28)
[2020-01-09T18:04:19.623 CET] [debug] no change: 291.6s
[2020-01-09T18:04:19.867 CET] [debug] considering VNC stalled, no update for 6.14 seconds
[2020-01-09T18:04:20.121 CET] [debug] pointer type 1 0 1024 768 -257
[2020-01-09T18:04:20.121 CET] [debug] led state 0 1 1 -261
[2020-01-09T18:04:20.618 CET] [debug] no change: 290.6s
[2020-01-09T18:04:21.620 CET] [debug] no change: 289.6s
/usr/bin/isotovideo: backen: malloc.c:3829: _int_malloc: Assertion `chunk_main_arena (bck->bk)' failed.
Unexpected end of data 0
[2020-01-09T18:04:22.640 CET] [debug] backend process exited: 0
[2020-01-09T18:04:22.641 CET] [debug] terminating command server 32299 because test execution ended
[2020-01-09T18:04:22.641 CET] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20153/0NP1yqsAinoUXvjN/broadcast
[2020-01-09T18:04:22.642 CET] [debug] Driver backend collected unknown process with pid 32378 and exit status: 1
[2020-01-09T18:04:22.656 CET] [debug] commands process exited: 0
[2020-01-09T18:04:22.656 CET] [debug] done with command server
[2020-01-09T18:04:22.657 CET] [debug] stopping autotest process 32314
[2020-01-09T18:04:22.657 CET] [debug] autotest received signal TERM, saving results of current test before exiting
[2020-01-09T18:04:23.243 CET] [debug] [autotest] process exited: 0
[2020-01-09T18:04:23.243 CET] [debug] done with autotest process
[2020-01-09T18:04:23.243 CET] [debug] isotovideo failed
[2020-01-09T18:04:23.244 CET] [debug] stopping backend process 32346
[2020-01-09T18:04:23.244 CET] [debug] done with backend process
32296: EXIT 1

see the line

/usr/bin/isotovideo: backen: malloc.c:3829: _int_malloc: Assertion `chunk_main_arena (bck->bk)' failed.

maybe the underlying issue is actually the same as #59926 , that isotovideo just crashes, sometimes with a timely message, sometimes silent.

Actions #5

Updated by okurz over 4 years ago

  • Has duplicate action #56261: [tools] test fails in yast2_bootloader- isotovideo failed, test execuition interrupted on xen added
Actions #6

Updated by okurz about 4 years ago

Actions #7

Updated by okurz about 4 years ago

  • Has duplicate action #60200: "considering VNC stalled, no update for" job incompletes just after VNC stall detection, no clear error added
Actions #8

Updated by okurz about 4 years ago

https://openqa.suse.de/tests/3850098/file/autoinst-log.txt looks like the same to me but it does not have ".autotest. process exited: 0".

system journal on the system shows:

Jan 31 02:30:02 openqaworker3 kernel: perl[30128]: segfault at 514 ip 0000563252eb16a0 sp 00007f72fb89d830 error 4 in perl[563252df1000+1f9000]
Jan 31 02:30:02 openqaworker3 kernel: perl[30129]: segfault at 514 ip 0000563252eb16a0 sp 00007f72fb09c830 error 4
Jan 31 02:30:02 openqaworker3 kernel: perl[30130]: segfault at 514 ip 0000563252eb16a0 sp 00007f72fa89b830 error 4 in perl[563252df1000+1f9000]
Jan 31 02:30:02 openqaworker3 kernel: perl[30131]: segfault at 514 ip 0000563252eb16a0 sp 00007f72fa09a830 error 4
Jan 31 02:30:02 openqaworker3 kernel:  in perl[563252df1000+1f9000]
Jan 31 02:30:02 openqaworker3 kernel:  in perl[563252df1000+1f9000]
Jan 31 02:30:02 openqaworker3 systemd-coredump[30520]: Failed to get EXE, ignoring: No such process
Jan 31 02:30:02 openqaworker3 systemd[1]: Started Process Core Dump (PID 30520/UID 0).
Jan 31 02:30:02 openqaworker3 systemd-coredump[30521]: Failed to compress (unnamed temporary file): Invalid argument
Jan 31 02:30:02 openqaworker3 systemd-coredump[30521]: Failed to generate stack trace: invalid `Elf' handle
Jan 31 02:30:02 openqaworker3 systemd-coredump[30521]: Process 30127 (/usr/bin/isotov) of user 489 dumped core.
Jan 31 02:30:02 openqaworker3 worker[28728]: [debug] [pid:30490] Uploading artefact yast2_nfs_server-117.png as f0023ce600fba82ebe50d0a7e9486532
Jan 31 02:30:03 openqaworker3 worker[28722]: [debug] [pid:28722] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3850089/status
Jan 31 02:30:03 openqaworker3 worker[28177]: [debug] [pid:28177] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3849157/status
Jan 31 02:30:03 openqaworker3 systemd-coredump[30493]: Process 30143 (/usr/bin/isotov) of user 489 dumped core.

                                                       Stack trace of thread 30261:
                                                       #0  0x0000563252eb16a0 Perl_csighandler (perl)
                                                       #1  0x00007f7307b1f300 __restore_rt (libpthread.so.0)
                                                       #2  0x00007f7307846269 syscall (libc.so.6)

so it looks like it's actually the same as #59926

Actions #9

Updated by okurz about 4 years ago

  • Subject changed from job incomplete with auto_review:"(?s).autotest. process exited: 0.*isotovideo failed.*EXIT 1" but no further details what is wrong to job incomplete with auto_review:"(?s)process exited: 0.*isotovideo failed.*EXIT 1" but no further details what is wrong

I guess we can live with a bit more greedy regex.

Actions #10

Updated by cfconrad about 4 years ago

  • Status changed from New to In Progress
  • Assignee set to cfconrad

This may be the same issue as we have in Wicked https://openqa.suse.de/tests/3925667

I'm not able to reproduce such issue in my own test environment. So I asked to get two worker instances on one of our workers.
I agreed with okurz to get two like this:

systemctl disable --now salt-minion openqa-worker@{23,24}
Actions #11

Updated by cfconrad about 4 years ago

  • Status changed from In Progress to Workable
  • Assignee deleted (cfconrad)

Duo to other tasks I reenabled the worker instances again. Will follow up on this later.

sudo systemctl enable --now salt-minion openqa-worker@{23,24}
sudo systemctl start --now salt-minion openqa-worker@{23,24}
Actions #12

Updated by okurz about 4 years ago

cfconrad wrote:

Duo to other tasks I reenabled the worker instances again. Will follow up on this later.

sudo systemctl enable --now salt-minion openqa-worker@{23,24}
sudo systemctl start --now salt-minion openqa-worker@{23,24}

Just a small hint: enable --now already means "enable and start".

Actions #13

Updated by okurz about 4 years ago

  • Subject changed from job incomplete with auto_review:"(?s)process exited: 0.*isotovideo failed.*EXIT 1" but no further details what is wrong to job incomplete with auto_review:"(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong
Actions #14

Updated by okurz about 4 years ago

recent example: https://openqa.opensuse.org/tests/1219514/file/autoinst-log.txt

[2020-03-31T04:23:41.093 CEST] [debug] tests/installation/boot_encrypt.pm:26 called utils::unlock_if_encrypted -> lib/utils.pm:320 called testapi::send_key
[2020-03-31T04:23:41.093 CEST] [debug] <<< testapi::send_key(key="ret", do_wait=0, wait_screen_change=0)
[2020-03-31T04:23:41.365 CEST] [debug] ||| finished boot_encrypt installation at 2020-03-31 02:23:41 (16 s)
[2020-03-31T04:23:41.365 CEST] [debug] ||| starting first_boot tests/installation/first_boot.pm
[2020-03-31T04:23:41.366 CEST] [debug] tests/installation/first_boot.pm:25 called opensusebasetest::wait_boot_past_bootloader -> lib/opensusebasetest.pm:898 called opensusebasetest::handle_displaymanager_login -> lib/opensusebasetest.pm:731 called testapi::assert_screen
[2020-03-31T04:23:41.366 CEST] [debug] <<< testapi::assert_screen(mustmatch=[
  "displaymanager",
  "emergency-shell",
  "emergency-mode"
], timeout=300)
[2020-03-31T04:23:42.674 CEST] [debug] WARNING: check_asserted_screen took 0.67 seconds for 39 candidate needles - make your needles more specific
[2020-03-31T04:23:42.674 CEST] [debug] no match: 299.3s, best candidate: displaymanager-gdm-20170703 (0.17)
[2020-03-31T04:23:43.003 CEST] [debug] no change: 298.3s
[2020-03-31T04:23:44.004 CEST] [debug] no change: 297.3s
[2020-03-31T04:23:45.005 CEST] [debug] no change: 296.3s
[2020-03-31T04:23:50.214 CEST] [debug] WARNING: check_asserted_screen took 4.21 seconds for 39 candidate needles - make your needles more specific
[2020-03-31T04:23:50.214 CEST] [debug] no match: 295.3s, best candidate: displaymanager-gdm-20170703 (0.17)
[2020-03-31T04:23:50.222 CEST] [debug] no change: 291.1s
[2020-03-31T04:23:50.465 CEST] [debug] considering VNC stalled, no update for 4.79 seconds
[2020-03-31T04:23:50.719 CEST] [debug] pointer type 1 0 1024 768 -257
[2020-03-31T04:23:50.719 CEST] [debug] led state 0 1 1 -261
[2020-03-31T04:23:51.217 CEST] [debug] no change: 290.1s
[2020-03-31T04:23:52.217 CEST] [debug] no change: 289.1s
[2020-03-31T04:23:53.278 CEST] [debug] backend process exited: 0
[2020-03-31T04:23:53.278 CEST] [debug] terminating command server 6427 because test execution ended
[2020-03-31T04:23:53.278 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/YvCTf_zBVTT1rF2g/broadcast
[2020-03-31T04:23:53.289 CEST] [debug] commands process exited: 0
[2020-03-31T04:23:53.289 CEST] [debug] done with command server
[2020-03-31T04:23:53.289 CEST] [debug] stopping autotest process 6516
[2020-03-31T04:23:53.289 CEST] [debug] autotest received signal TERM, saving results of current test before exiting
[2020-03-31T04:23:53.295 CEST] [debug] [autotest] process exited: 0
[2020-03-31T04:23:53.295 CEST] [debug] done with autotest process
[2020-03-31T04:23:53.295 CEST] [debug] isotovideo failed
[2020-03-31T04:23:53.296 CEST] [debug] stopping backend process 6532
[2020-03-31T04:23:53.296 CEST] [debug] done with backend process
6421: EXIT 1
[2020-03-31T04:23:53.0345 CEST] [info] Isotovideo exit status: 1
[2020-03-31T04:23:56.0381 CEST] [info] +++ worker notes +++
[2020-03-31T04:23:56.0381 CEST] [info] End time: 2020-03-31 02:23:56
[2020-03-31T04:23:56.0381 CEST] [info] Result: died: terminated prematurely, see log output for details
Actions #15

Updated by okurz almost 4 years ago

  • Related to action #53999: openqa-worker (isotovideo) dumps core / segfaults quite often on several workers and distributions for cancelled jobs added
Actions #16

Updated by asmorodskyi almost 4 years ago

#66619 - cross posting here too , there might be some connection between issues

Actions #17

Updated by okurz almost 4 years ago

I do not see a helpful relation to that ticket.

https://openqa.opensuse.org/tests/1268572/file/autoinst-log.txt is a recent example:

[2020-05-16T18:34:07.132 CEST] [debug] >>> testapi::_check_backend_response: match=emergency-mode,emergency-shell,generic-desktop,opensuse-welcome timed out after 30 (check_screen)
[2020-05-16T18:34:07.155 CEST] [debug] tests/installation/setup_zdup.pm:24 called opensusebasetest::wait_boot -> lib/opensusebasetest.pm:1003 called opensusebasetest::wait_boot_past_bootloader -> lib/opensusebasetest.pm:932 called testapi::check_screen
[2020-05-16T18:34:07.155 CEST] [debug] <<< testapi::check_screen(mustmatch=[
  "generic-desktop",
  "emergency-shell",
  "emergency-mode",
  "opensuse-welcome"
], timeout=30)
[2020-05-16T18:34:08.767 CEST] [debug] WARNING: check_asserted_screen took 1.61 seconds for 37 candidate needles - make your needles more specific
[2020-05-16T18:34:08.768 CEST] [debug] no match: 29.9s, best candidate: generic-desktop-boo1170586_welcome_in_taskbar_but_no_window-20200503 (0.00)
[2020-05-16T18:34:08.775 CEST] [debug] no change: 28.3s
…
[2020-05-16T18:34:31.161 CEST] [debug] no change: 5.9s
[2020-05-16T18:34:31.662 CEST] [debug] considering VNC stalled, no update for 4.00 seconds
[2020-05-16T18:34:31.913 CEST] [debug] pointer type 1 0 1024 768 -257
[2020-05-16T18:34:31.914 CEST] [debug] led state 0 1 1 -261
[2020-05-16T18:34:32.162 CEST] [debug] no change: 4.9s
[2020-05-16T18:34:34.880 CEST] [debug] WARNING: check_asserted_screen took 1.72 seconds for 37 candidate needles - make your needles more specific
[2020-05-16T18:34:34.880 CEST] [debug] no match: 3.9s, best candidate: generic-desktop-boo1170586_welcome_in_taskbar_but_no_window-20200503 (0.00)
[2020-05-16T18:34:34.896 CEST] [debug] no change: 2.2s
[2020-05-16T18:34:37.626 CEST] [debug] WARNING: check_asserted_screen took 1.73 seconds for 37 candidate needles - make your needles more specific
[2020-05-16T18:34:37.626 CEST] [debug] no match: 1.2s, best candidate: generic-desktop-boo1170586_welcome_in_taskbar_but_no_window-20200503 (0.00)
[2020-05-16T18:34:39.530 CEST] [debug] WARNING: check_asserted_screen took 1.90 seconds for 37 candidate needles - make your needles more specific
[2020-05-16T18:34:39.530 CEST] [debug] no match: -0.5s, best candidate: generic-desktop-boo1170586_welcome_in_taskbar_but_no_window-20200503 (0.00)
[2020-05-16T18:34:39.578 CEST] [debug] backend process exited: 0
[2020-05-16T18:34:39.579 CEST] [debug] terminating command server 15026 because test execution ended
…
15021: EXIT 1

Is maybe the last, negative timeout value related or the "considering VNC stalled"?

Actions #18

Updated by okurz almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15_pscc_lp-basesys-srv-desk-dev-contm-lgm-wsm_all_full
https://openqa.suse.de/tests/4297153

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #19

Updated by openqa_review almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles12sp5_pscc_sdk-tcm-wsm_all_full
https://openqa.suse.de/tests/4328541

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #20

Updated by okurz almost 4 years ago

  • Target version set to Ready

this should be crosschecked against the current state as we have better reporting with the incomplete "reason". Also check current state within https://gitlab.suse.de/openqa/auto-review/pipelines

Actions #21

Updated by okurz over 3 years ago

  • Subject changed from job incomplete with auto_review:"(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong to job incomplete with "(?s)process exited: 0.*isotovideo failed.*EXIT 1":retry but no further details what is wrong
  • Status changed from Workable to Resolved
  • Assignee set to okurz

situation looks better. Because of the rather generic catch-expression https://gitlab.suse.de/openqa/auto-review still referenced this ticket and e.g. SQL query select jobs.id,t_finished,test,reason,result from jobs,comments where result = 'incomplete' and t_finished >= '2020-07-20' and comments.job_id = jobs.id and text ~ 'poo#60443'; shows some but they are for another reason, e.g. https://openqa.opensuse.org/tests/1342177# with more details and linked to video encoder config failures.

Actions

Also available in: Atom PDF