Project

General

Profile

Actions

action #55883

closed

known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling

Added by cfconrad over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
QE Kernel - QE Kernel Done
Start date:
2019-08-23
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

When https://github.com/os-autoinst/os-autoinst/pull/1182 was deployed, we encountered a lot of incomplete jobs on o3.
There seems to be a correlation between save_memory_dump() and qemu left overs, if this PR was included.

This ticked https://progress.opensuse.org/issues/55595 show some investigation on save_memory_dump() in short, I think it was that xz return with 2 on warnings. And we have use autodie :all in qemu.pm which then just let this process die() and the job end's with ´incomplete`.

So I guess that qemu sometimes have problem in exact that "unexpected" end, to cleanup and kill all qemu instances.

The investigation documented in https://progress.opensuse.org/issues/55505 pinpoints to the PR#1182 and also mention a reproducer like:

openqa-clone-job --from https://openqa.opensuse.org/tests/1006973 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/poo55505_migration_incompletes WORKER_CLASS=qemu_x86_64_tw

Unfortunately, I wasn't able to reproduce it till now...


Related issues 2 (0 open2 closed)

Related to openQA Tests - action #54260: [kernel][s390x] Occasional failures due timeoutRejectedcfconrad2019-07-15

Actions
Related to openQA Tests - action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.Resolvedcfconrad2019-12-05

Actions
Actions #1

Updated by cfconrad over 4 years ago

  • Related to action #54260: [kernel][s390x] Occasional failures due timeout added
Actions #2

Updated by okurz over 4 years ago

  • Category set to Bugs in existing tests
Actions #3

Updated by cfconrad over 4 years ago

  • Status changed from New to In Progress
Actions #4

Updated by cfconrad over 4 years ago

  • Status changed from In Progress to Feedback

Added some cleanup, but still not able to reproduce.
Reopened the PR

https://github.com/os-autoinst/os-autoinst/pull/1274

=> waiting for merge.

Actions #5

Updated by cfconrad over 4 years ago

PR was merged, but introduced a regression!!

From autoinst-log.txt

[2019-12-04T13:40:28.465 CET] [debug] <<< consoles::serial_screen::type_string(text='
', json_cmd_token='bjYlPJkG', cmd='backend_type_string')
[2019-12-04T13:40:28.466 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.

Issue was introduced with https://github.com/os-autoinst/os-autoinst/pull/1274. Looks like I forgot on rebase to update ssh_serial.pm which was introduced in https://github.com/os-autoinst/os-autoinst/pull/1176.

This PR https://github.com/os-autoinst/os-autoinst/pull/1313 should fix it.

If someone knows a better way to call super class constructor, plz let me know.

Actions #6

Updated by cfconrad over 4 years ago

  • Related to action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106. added
Actions #7

Updated by okurz over 4 years ago

https://openqa.suse.de/tests/3678200 incompleted with

[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
  'connect_timeout' => 3,
  'port' => 5997,
  'hostname' => 'localhost'
}
[2019-12-07T14:37:30.894 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:37:30.894 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.896 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147

https://github.com/os-autoinst/os-autoinst/pull/1313 was already deployed on the affected worker. can you take a look please?

Actions #8

Updated by okurz over 4 years ago

  • Subject changed from [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling to known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling
Actions #9

Updated by cfconrad over 4 years ago

Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147

I think it is related to https://github.com/os-autoinst/openQA/pull/2257 which should adopt the apparmor profile.
Going to double check the worker.

Actions #10

Updated by cfconrad over 4 years ago

The workers /etc/apparmor.d/usr.share.openqa.script.worker looks ok. Need further investigation.

Actions #11

Updated by cfconrad over 4 years ago

From my point of view, it cannot be related to apparmor. The same IOCTLs are used before on the second console:

[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576

But then they fail on /var/lib/openqa/pool/7/virtio_console.

Maybe the worker was out of ram?

Actions #12

Updated by cfconrad over 4 years ago

@okurz, can you tell me if this happen often? And if so, how can I track it.

Actions #13

Updated by pvorel over 4 years ago

IMHO missing RAM wouldn't return Operation not permitted (EPERM). It might not be apparmor, but I bet it's security related.

Actions #14

Updated by cfconrad over 4 years ago

@pvorel I don't understand why this happen on one pipe but close after on other not and both pipes are created by the same process.

When you search in autoinst-log.txt for open_pipe you will find the same command multiple times succeed without any problem.

So most of the time we see this:

[2019-12-07T14:28:03.397 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.399 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576

And only once the error.

Maybe we should just not autodie on such error and continue or try with less size.

Actions #15

Updated by okurz over 4 years ago

cfconrad wrote:

@okurz, can you tell me if this happen often? And if so, how can I track it.

https://gitlab.suse.de/openqa/auto-review/pipelines runs recurringly every day and labels incomplete jobs with this ticket whenever the issue is encountered. To find these issues you can query the database on osd, e.g. with

select job_id from comments where text ~ 'poo#55883';

in SQL which as of now only shows a single job, the one I reported about. Maybe there are no more but auto-review does not yet parse the "known_issue" from tickets, waiting for https://github.com/os-autoinst/scripts/pull/14

Actions #16

Updated by cfconrad over 4 years ago

Thx for explanation.
I actually never took a look into os-autoinst/scripts :) nice!

Actions #17

Updated by cfconrad about 4 years ago

  • Status changed from Feedback to Resolved
Actions #18

Updated by metan about 4 years ago

  • Target version changed from 445 to 457
Actions #19

Updated by pcervinka over 3 years ago

  • Target version changed from 457 to QE Kernel Done
Actions

Also available in: Atom PDF