action #104077
closedbackend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M
Description
Observation¶
https://openqa.suse.de/tests/7857824/logfile?filename=autoinst-log.txt shows
[2021-12-15T16:37:40.022513+01:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985
backend::qemu::syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), "{\"execute\":\"query-status\"}\x{a}") called at /usr/lib/os-autoinst/backend/qemu.pm line 985
backend::qemu::handle_qmp_command(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9ed9f280), "fatal", 1) called at /usr/lib/os-autoinst/backend/qemu.pm line 204
backend::qemu::_wait_while_status_is(backend::qemu=HASH(0x558d9f2f1220), qr(migrate)u, 300, "Timed out while loading snapshot") called at /usr/lib/os-autoinst/backend/qemu.pm line 459
backend::qemu::load_snapshot(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9dd8ba68)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 74
backend::baseclass::handle_command(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9dd86b58)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 545
backend::baseclass::check_socket(backend::qemu=HASH(0x558d9f2f1220), IO::Handle=GLOB(0x558d9f25b780)) called at /usr/lib/os-autoinst/backend/qemu.pm line 1060
backend::qemu::check_socket(backend::qemu=HASH(0x558d9f2f1220), IO::Handle=GLOB(0x558d9f25b780), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 236
backend::baseclass::do_capture(backend::qemu=HASH(0x558d9f2f1220), undef, 1639581624.84715) called at /usr/lib/os-autoinst/backend/baseclass.pm line 265
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 265
backend::baseclass::run_capture_loop(backend::qemu=HASH(0x558d9f2f1220)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 128
backend::baseclass::run(backend::qemu=HASH(0x558d9f2f1220), 13, 16) called at /usr/lib/os-autoinst/backend/driver.pm line 70
backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040), CODE(0x558d9eb1e900)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040)) called at /usr/lib/os-autoinst/backend/driver.pm line 71
backend::driver::start(backend::driver=HASH(0x558d9f2b1ef8)) called at /usr/lib/os-autoinst/backend/driver.pm line 38
backend::driver::new("backend::driver", "qemu") called at /usr/bin/isotovideo line 211
main::init_backend() called at /usr/bin/isotovideo line 262
[2021-12-15T16:37:40.022856+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
A cloned job https://openqa.suse.de/tests/7863925 does not immediately reproduce this problem
Reproducible¶
So far only a single occurence was found
Acceptance criteria¶
- AC1: We at least improved the code a little bit
Problem¶
We see a not so helpful stack trace.
Suggestions¶
- Look at the exact code section and take a look what could go wrong there, improve test code coverage, logging and error handling.
- Don't try to fix a problem that we can't reproduce :)
- Switch over to transactional upgrades
Updated by okurz about 3 years ago
- Priority changed from Normal to High
- Target version set to Ready
Hi GraceWang, did you encounter this just once or multiple times? Can it be reproduced?
Updated by GraceWang about 3 years ago
okurz wrote:
Hi GraceWang, did you encounter this just once or multiple times? Can it be reproduced?
This is the first time I see this error. I also searched and didn't find a poo with the same error.
Not sure if this can be reproduced by restarting this job. Let me try.
Updated by osukup about 3 years ago
Cause of this problem is most probadly upgrade of qemu during job run
from logs:
start of job:
[2021-12-15T16:20:27.104592+01:00] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-12-15T16:20:27.104666+01:00] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-12-15T16:20:27.108104+01:00] [debug] Snapshots are supported
Part where job failed:
[2021-12-15T16:37:40.022856+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-12-15T16:37:40.023775+01:00] [debug] Passing remaining frames to the video encoder
[2021-12-15T16:37:40.061298+01:00] [debug] Waiting for video encoder to finalize the video
[2021-12-15T16:37:40.061370+01:00] [debug] The built-in video encoder (pid 12347) terminated
[2021-12-15T16:37:40.061559+01:00] [debug] QEMU: QEMU emulator version 5.2.0 (SUSE Linux Enterprise 15)
[2021-12-15T16:37:40.061608+01:00] [debug] QEMU: Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
[2021-12-15T16:37:40.061672+01:00] [warn] !!! : qemu-system-x86_64: Machine type received is 'pc-i440fx-4.2' and local is 'pc-i440fx-5.2'
[2021-12-15T16:37:40.061846+01:00] [debug] sending magic and exit
[2021-12-15T16:37:40.062292+01:00] [debug] received magic close
[2021-12-15T16:37:40.063180+01:00] [debug] THERE IS NOTHING TO READ 15 4 3
[2021-12-15T16:37:40.063457+01:00] [debug] stopping command server 12196 because test execution ended
so upgrade of system/ update of quemu from 4.2 to 5.2.0 is most likely cause
Updated by mkittler about 3 years ago
- Subject changed from backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 to backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz almost 3 years ago
- Related to action #99192: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M added
Updated by okurz almost 3 years ago
- Status changed from Workable to Resolved
- Assignee set to okurz
@osukup good observation. So we had a "lucky" case of a job that ran exactly during the time when the OSD worker upgrade from Leap 15.2 to 15.3 was conducted. It is unlikely to happen again in the same way anytime soon so the impact is low. To properly prevent such situation I see the following three ways when conducting upgrades:
- Shut down all monitoring and services during upgrade until after the upgrade and slowly bring back up systems in a consistent state. Likely more work than upgrading and repairing where it breaks especially because in most cases the automatic cloning on incomplete openQA jobs already takes place
- Use transactional-update which can also work with r/w root partitions to ensure proper root file system upgrades with reboots. Still relying on btrfs though which we don't use on all machines so this is not generally applicable in our current setup of installations
- Re-deploy the complete OSD infrastructure on a container management system, e.g. a Rancher controlled Kubernetes cluster with our workloads running in container
We discussed in the daily 2021-12-21 what solution we are looking for or what other minor things we can improve but we could not easily find a way forward which is properly aligned with the ticket so I think it's easier to just call this "Resolved" as the root cause was properly identified and fixed by completing the upgrades on OSD workers with #99192