Project

General

Profile

Actions

action #104077

closed

backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M

Added by GraceWang over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-12-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/7857824/logfile?filename=autoinst-log.txt shows

[2021-12-15T16:37:40.022513+01:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985
      backend::qemu::syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), "{\"execute\":\"query-status\"}\x{a}") called at /usr/lib/os-autoinst/backend/qemu.pm line 985
      backend::qemu::handle_qmp_command(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9ed9f280), "fatal", 1) called at /usr/lib/os-autoinst/backend/qemu.pm line 204
      backend::qemu::_wait_while_status_is(backend::qemu=HASH(0x558d9f2f1220), qr(migrate)u, 300, "Timed out while loading snapshot") called at /usr/lib/os-autoinst/backend/qemu.pm line 459
      backend::qemu::load_snapshot(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9dd8ba68)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 74
      backend::baseclass::handle_command(backend::qemu=HASH(0x558d9f2f1220), HASH(0x558d9dd86b58)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 545
      backend::baseclass::check_socket(backend::qemu=HASH(0x558d9f2f1220), IO::Handle=GLOB(0x558d9f25b780)) called at /usr/lib/os-autoinst/backend/qemu.pm line 1060
      backend::qemu::check_socket(backend::qemu=HASH(0x558d9f2f1220), IO::Handle=GLOB(0x558d9f25b780), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 236
      backend::baseclass::do_capture(backend::qemu=HASH(0x558d9f2f1220), undef, 1639581624.84715) called at /usr/lib/os-autoinst/backend/baseclass.pm line 265
      eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 265
      backend::baseclass::run_capture_loop(backend::qemu=HASH(0x558d9f2f1220)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 128
      backend::baseclass::run(backend::qemu=HASH(0x558d9f2f1220), 13, 16) called at /usr/lib/os-autoinst/backend/driver.pm line 70
      backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
      eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
      Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040), CODE(0x558d9eb1e900)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
      Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x558d9f8f8040)) called at /usr/lib/os-autoinst/backend/driver.pm line 71
      backend::driver::start(backend::driver=HASH(0x558d9f2b1ef8)) called at /usr/lib/os-autoinst/backend/driver.pm line 38
      backend::driver::new("backend::driver", "qemu") called at /usr/bin/isotovideo line 211
      main::init_backend() called at /usr/bin/isotovideo line 262
[2021-12-15T16:37:40.022856+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json

A cloned job https://openqa.suse.de/tests/7863925 does not immediately reproduce this problem

Reproducible

So far only a single occurence was found

Acceptance criteria

  • AC1: We at least improved the code a little bit

Problem

We see a not so helpful stack trace.

Suggestions

  • Look at the exact code section and take a look what could go wrong there, improve test code coverage, logging and error handling.
  • Don't try to fix a problem that we can't reproduce :)
  • Switch over to transactional upgrades

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #99192: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:MResolvedlivdywan

Actions
Actions #1

Updated by okurz over 2 years ago

  • Priority changed from Normal to High
  • Target version set to Ready

Hi GraceWang, did you encounter this just once or multiple times? Can it be reproduced?

Actions #2

Updated by GraceWang over 2 years ago

okurz wrote:

Hi GraceWang, did you encounter this just once or multiple times? Can it be reproduced?

This is the first time I see this error. I also searched and didn't find a poo with the same error.
Not sure if this can be reproduced by restarting this job. Let me try.

Actions #3

Updated by mkittler over 2 years ago

  • Description updated (diff)
Actions #4

Updated by osukup over 2 years ago

Cause of this problem is most probadly upgrade of qemu during job run

from logs:

start of job:


[2021-12-15T16:20:27.104592+01:00] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-12-15T16:20:27.104666+01:00] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-12-15T16:20:27.108104+01:00] [debug] Snapshots are supported

Part where job failed:

[2021-12-15T16:37:40.022856+01:00] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-12-15T16:37:40.023775+01:00] [debug] Passing remaining frames to the video encoder
[2021-12-15T16:37:40.061298+01:00] [debug] Waiting for video encoder to finalize the video
[2021-12-15T16:37:40.061370+01:00] [debug] The built-in video encoder (pid 12347) terminated
[2021-12-15T16:37:40.061559+01:00] [debug] QEMU: QEMU emulator version 5.2.0 (SUSE Linux Enterprise 15)
[2021-12-15T16:37:40.061608+01:00] [debug] QEMU: Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
[2021-12-15T16:37:40.061672+01:00] [warn] !!! : qemu-system-x86_64: Machine type received is 'pc-i440fx-4.2' and local is 'pc-i440fx-5.2'
[2021-12-15T16:37:40.061846+01:00] [debug] sending magic and exit
[2021-12-15T16:37:40.062292+01:00] [debug] received magic close
[2021-12-15T16:37:40.063180+01:00] [debug] THERE IS NOTHING TO READ 15 4 3
[2021-12-15T16:37:40.063457+01:00] [debug] stopping command server 12196 because test execution ended

so upgrade of system/ update of quemu from 4.2 to 5.2.0 is most likely cause

Actions #5

Updated by mkittler over 2 years ago

  • Subject changed from backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 to backend died: Can't syswrite(IO::Socket::UNIX=GLOB(0x558d9dd5cb68), <BUFFER>): Broken pipe at /usr/lib/os-autoinst/backend/qemu.pm line 985 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan over 2 years ago

  • Description updated (diff)
Actions #7

Updated by okurz over 2 years ago

  • Related to action #99192: Upgrade osd workers and openqa-monitor to openSUSE Leap 15.3 size:M added
Actions #8

Updated by okurz over 2 years ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz

@osukup good observation. So we had a "lucky" case of a job that ran exactly during the time when the OSD worker upgrade from Leap 15.2 to 15.3 was conducted. It is unlikely to happen again in the same way anytime soon so the impact is low. To properly prevent such situation I see the following three ways when conducting upgrades:

  1. Shut down all monitoring and services during upgrade until after the upgrade and slowly bring back up systems in a consistent state. Likely more work than upgrading and repairing where it breaks especially because in most cases the automatic cloning on incomplete openQA jobs already takes place
  2. Use transactional-update which can also work with r/w root partitions to ensure proper root file system upgrades with reboots. Still relying on btrfs though which we don't use on all machines so this is not generally applicable in our current setup of installations
  3. Re-deploy the complete OSD infrastructure on a container management system, e.g. a Rancher controlled Kubernetes cluster with our workloads running in container

We discussed in the daily 2021-12-21 what solution we are looking for or what other minor things we can improve but we could not easily find a way forward which is properly aligned with the ticket so I think it's easier to just call this "Resolved" as the root cause was properly identified and fixed by completing the upgrades on OSD workers with #99192

Actions

Also available in: Atom PDF