Project

General

Profile

action #55883

known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling

Added by cfconrad about 2 years ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
QE Kernel - QE Kernel Done
Start date:
2019-08-23
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

When https://github.com/os-autoinst/os-autoinst/pull/1182 was deployed, we encountered a lot of incomplete jobs on o3.
There seems to be a correlation between save_memory_dump() and qemu left overs, if this PR was included.

This ticked https://progress.opensuse.org/issues/55595 show some investigation on save_memory_dump() in short, I think it was that xz return with 2 on warnings. And we have use autodie :all in qemu.pm which then just let this process die() and the job end's with ┬┤incomplete`.

So I guess that qemu sometimes have problem in exact that "unexpected" end, to cleanup and kill all qemu instances.

The investigation documented in https://progress.opensuse.org/issues/55505 pinpoints to the PR#1182 and also mention a reproducer like:

openqa-clone-job --from https://openqa.opensuse.org/tests/1006973 CASEDIR=https://github.com/okurz/os-autoinst-distri-opensuse.git#fix/poo55505_migration_incompletes WORKER_CLASS=qemu_x86_64_tw

Unfortunately, I wasn't able to reproduce it till now...


Related issues

Related to openQA Tests - action #54260: [kernel][s390x] Occasional failures due timeoutRejected2019-07-15

Related to openQA Tests - action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.Resolved2019-12-05

History

#1 Updated by cfconrad about 2 years ago

  • Related to action #54260: [kernel][s390x] Occasional failures due timeout added

#2 Updated by okurz almost 2 years ago

  • Category set to Bugs in existing tests

#3 Updated by cfconrad almost 2 years ago

  • Status changed from New to In Progress

#4 Updated by cfconrad almost 2 years ago

  • Status changed from In Progress to Feedback

Added some cleanup, but still not able to reproduce.
Reopened the PR

https://github.com/os-autoinst/os-autoinst/pull/1274

=> waiting for merge.

#5 Updated by cfconrad almost 2 years ago

PR was merged, but introduced a regression!!

From autoinst-log.txt

[2019-12-04T13:40:28.465 CET] [debug] <<< consoles::serial_screen::type_string(text='
', json_cmd_token='bjYlPJkG', cmd='backend_type_string')
[2019-12-04T13:40:28.466 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106.

Issue was introduced with https://github.com/os-autoinst/os-autoinst/pull/1274. Looks like I forgot on rebase to update ssh_serial.pm which was introduced in https://github.com/os-autoinst/os-autoinst/pull/1176.

This PR https://github.com/os-autoinst/os-autoinst/pull/1313 should fix it.

If someone knows a better way to call super class constructor, plz let me know.

#6 Updated by cfconrad almost 2 years ago

  • Related to action #60725: [kernel]Can't use an undefined value as a symbol reference at /usr/lib/os-autoinst/consoles/serial_screen.pm line 106. added

#7 Updated by okurz almost 2 years ago

https://openqa.suse.de/tests/3678200 incompleted with

[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
/usr/lib/os-autoinst/consoles/vnc_base.pm:62:{
  'connect_timeout' => 3,
  'port' => 5997,
  'hostname' => 'localhost'
}
[2019-12-07T14:37:30.894 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:37:30.894 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.896 CET] [debug] Backend process died, backend errors are reported below in the following lines:
Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147

https://github.com/os-autoinst/os-autoinst/pull/1313 was already deployed on the affected worker. can you take a look please?

#8 Updated by okurz almost 2 years ago

  • Subject changed from [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling to known_issue:"Can.t fcntl" [kernel] Investigate qemu left over, when switching from unix-domain sockets to named-pipes used for virtio_console handling

#9 Updated by cfconrad almost 2 years ago

Can't fcntl($fh, '1031', '1048576'): Operation not permitted at /usr/lib/os-autoinst/consoles/virtio_terminal.pm line 147

I think it is related to https://github.com/os-autoinst/openQA/pull/2257 which should adopt the apparmor profile.
Going to double check the worker.

#10 Updated by cfconrad almost 2 years ago

The workers /etc/apparmor.d/usr.share.openqa.script.worker looks ok. Need further investigation.

#11 Updated by cfconrad almost 2 years ago

From my point of view, it cannot be related to apparmor. The same IOCTLs are used before on the second console:

[2019-12-07T14:37:30.891 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:37:30.891 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576

But then they fail on /var/lib/openqa/pool/7/virtio_console.

Maybe the worker was out of ram?

#12 Updated by cfconrad almost 2 years ago

okurz, can you tell me if this happen often? And if so, how can I track it.

#13 Updated by pvorel almost 2 years ago

IMHO missing RAM wouldn't return Operation not permitted (EPERM). It might not be apparmor, but I bet it's security related.

#14 Updated by cfconrad almost 2 years ago

pvorel I don't understand why this happen on one pipe but close after on other not and both pipes are created by the same process.

When you search in autoinst-log.txt for open_pipe you will find the same command multiple times succeed without any problem.

So most of the time we see this:

[2019-12-07T14:28:03.397 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console1')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.398 CET] [debug] <<< consoles::virtio_terminal::open_pipe(pipe_prefix='/var/lib/openqa/pool/7/virtio_console')
[2019-12-07T14:28:03.398 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576
[2019-12-07T14:28:03.399 CET] [info] ::: consoles::virtio_terminal::open_pipe: Set PIPE_SZ from 65536 to 1048576

And only once the error.

Maybe we should just not autodie on such error and continue or try with less size.

#15 Updated by okurz almost 2 years ago

cfconrad wrote:

okurz, can you tell me if this happen often? And if so, how can I track it.

https://gitlab.suse.de/openqa/auto-review/pipelines runs recurringly every day and labels incomplete jobs with this ticket whenever the issue is encountered. To find these issues you can query the database on osd, e.g. with

select job_id from comments where text ~ 'poo#55883';

in SQL which as of now only shows a single job, the one I reported about. Maybe there are no more but auto-review does not yet parse the "known_issue" from tickets, waiting for https://github.com/os-autoinst/scripts/pull/14

#16 Updated by cfconrad almost 2 years ago

Thx for explanation.
I actually never took a look into os-autoinst/scripts :) nice!

#17 Updated by cfconrad over 1 year ago

  • Status changed from Feedback to Resolved

#18 Updated by metan over 1 year ago

  • Target version changed from 445 to 457

#19 Updated by pcervinka 11 months ago

  • Target version changed from 457 to QE Kernel Done

Also available in: Atom PDF