action #55883
closed
known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling
Added by cfconrad over 5 years ago.
Updated about 4 years ago.
Category:
Bugs in existing tests
Target version:
QE Kernel - QE Kernel Done
Description
When https://github.com/os-autoinst/os-autoinst/pull/1182 was deployed, we encountered a lot of incomplete
jobs on o3.
There seems to be a correlation between save_memory_dump()
and qemu left overs, if this PR was included.
This ticked https://progress.opensuse.org/issues/55595 show some investigation on save_memory_dump()
in short, I think it was that xz
return with 2 on warnings. And we have use autodie :all
in qemu.pm which then just let this process die() and the job end's with ´incomplete`.
So I guess that qemu sometimes have problem in exact that "unexpected" end, to cleanup and kill all qemu instances.
The investigation documented in https://progress.opensuse.org/issues/55505 pinpoints to the PR#1182 and also mention a reproducer like:
openqa-clone-job --from https://openqa.opensuse.org/tests/1006973 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/poo55505_migration_incompletes WORKER_CLASS=qemu_x86_64_tw
Unfortunately, I wasn't able to reproduce it till now...
- Related to action #54260: [kernel][s390x] Occasional failures due timeout added
- Category set to Bugs in existing tests
- Status changed from New to In Progress
- Status changed from In Progress to Feedback
PR was merged, but introduced a regression!!
From autoinst-log.txt
[2019-12-04T13:40:28.465 CET] [debug] <<< consoles::serial_screen::type_string(text='
', json_cmd_token='bjYlPJkG', cmd='backend_type_string')
[2019-12-04T13:40:28.466 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.
Issue was introduced with https://github.com/os-autoinst/os-autoinst/pull/1274. Looks like I forgot on rebase to update ssh_serial.pm which was introduced in https://github.com/os-autoinst/os-autoinst/pull/1176.
This PR https://github.com/os-autoinst/os-autoinst/pull/1313 should fix it.
If someone knows a better way to call super class constructor, plz let me know.
- Related to action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106. added
https://openqa.suse.de/tests/3678200 incompleted with
[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
'connect_timeout' => 3,
'port' => 5997,
'hostname' => 'localhost'
}
[2019-12-07T14:37:30.894 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:37:30.894 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.896 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147
https://github.com/os-autoinst/os-autoinst/pull/1313 was already deployed on the affected worker. can you take a look please?
- Subject changed from [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling to known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling
Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147
I think it is related to https://github.com/os-autoinst/openQA/pull/2257 which should adopt the apparmor profile.
Going to double check the worker.
The workers /etc/apparmor.d/usr.share.openqa.script.worker looks ok. Need further investigation.
From my point of view, it cannot be related to apparmor. The same IOCTLs are used before on the second console:
[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
But then they fail on /var/lib/openqa/pool/7/virtio_console
.
Maybe the worker was out of ram?
@okurz, can you tell me if this happen often? And if so, how can I track it.
IMHO missing RAM wouldn't return Operation not permitted (EPERM). It might not be apparmor, but I bet it's security related.
@pvorel I don't understand why this happen on one pipe but close after on other not and both pipes are created by the same process.
When you search in autoinst-log.txt for open_pipe
you will find the same command multiple times succeed without any problem.
So most of the time we see this:
[2019-12-07T14:28:03.397 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.399 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
And only once the error.
Maybe we should just not autodie on such error and continue or try with less size.
cfconrad wrote:
@okurz, can you tell me if this happen often? And if so, how can I track it.
https://gitlab.suse.de/openqa/auto-review/pipelines runs recurringly every day and labels incomplete jobs with this ticket whenever the issue is encountered. To find these issues you can query the database on osd, e.g. with
select job_id from comments where text ~ 'poo#55883';
in SQL which as of now only shows a single job, the one I reported about. Maybe there are no more but auto-review does not yet parse the "known_issue" from tickets, waiting for https://github.com/os-autoinst/scripts/pull/14
Thx for explanation.
I actually never took a look into os-autoinst/scripts :) nice!
- Status changed from Feedback to Resolved
- Target version changed from 445 to 457
- Target version changed from 457 to QE Kernel Done
Also available in: Atom
PDF