Project

General

Profile

action #75232

coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues

error message when worker has no network (yet): Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86."

Added by okurz 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2020-10-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Currently openqaworker8 has problems to bring the network up due to #73633 , all workers seem to gracefully handle the slow startup but there is an error (disguised as debug message), from journalctl -b -u openqa-worker@1:

-- Logs begin at Wed 2018-03-07 16:47:21 CET, end at Sat 2020-10-24 13:11:47 CEST. --
Oct 24 13:09:14 linux-fwcx systemd[1]: Starting openQA Worker #1...
Oct 24 13:09:15 linux-fwcx systemd[1]: Started openQA Worker #1.
Oct 24 13:09:16 linux-fwcx worker[3296]: [2020-10-24T13:09:16.359 CEST] [debug] Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86.
Oct 24 13:09:16 linux-fwcx worker[3296]: [info] [pid:3296] worker 1:
Oct 24 13:09:16 linux-fwcx worker[3296]:  - config file:           /etc/openqa/workers.ini
Oct 24 13:09:16 linux-fwcx worker[3296]:  - worker hostname:       linux-fwcx
Oct 24 13:09:16 linux-fwcx worker[3296]:  - isotovideo version:    20
Oct 24 13:09:16 linux-fwcx worker[3296]:  - websocket API version: 1
Oct 24 13:09:16 linux-fwcx worker[3296]:  - web UI hosts:          openqa.suse.de
Oct 24 13:09:16 linux-fwcx worker[3296]:  - class:                 caasp_x86_64,tap,qemu_x86_64,openqaworker8
Oct 24 13:09:16 linux-fwcx worker[3296]:  - no cleanup:            no
Oct 24 13:09:16 linux-fwcx worker[3296]:  - pool directory:        /var/lib/openqa/pool/1
Oct 24 13:09:16 linux-fwcx worker[3296]: [info] [pid:3296] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Oct 24 13:09:16 linux-fwcx worker[3296]: [info] [pid:3296] Project dir for host openqa.suse.de is /var/lib/openqa/share
Oct 24 13:09:16 linux-fwcx worker[3296]: [info] [pid:3296] Registering with openQA openqa.suse.de
Oct 24 13:09:16 linux-fwcx worker[3296]: [warn] [pid:3296] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Oct 24 13:09:26 openqaworker8 worker[3296]: [info] [pid:3296] Registering with openQA openqa.suse.de
Oct 24 13:09:26 openqaworker8 worker[3296]: [info] [pid:3296] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/1358
Oct 24 13:09:26 openqaworker8 worker[3296]: [info] [pid:3296] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 1358

see the message "Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86."

Acceptance criteria

  • AC1: No error message about the problem to open the file on startup, e.g. when there is no active network connection (yet)

Suggestions

Either prevent the error condition after analyzing the code or first try to reproduce the issue, e.g. in an environment with simulated broken network connection, e.g. using the tool "unshare" as we do in some of our tests or a clean container environment where this can be simulated. It might be that this has nothing to do with "network" but just startup of a worker.

Workaround

Error message can be ignored

History

#1 Updated by mkittler 8 months ago

  • Description updated (diff)

#2 Updated by okurz 7 months ago

  • Estimated time set to 39719.00 h

#3 Updated by okurz 7 months ago

  • Estimated time deleted (39719.00 h)

#4 Updated by okurz 6 months ago

  • Parent task set to #39719

#5 Updated by okurz 6 months ago

  • Description updated (diff)

#6 Updated by Xiaojing_liu 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to Xiaojing_liu

#7 Updated by Xiaojing_liu 5 months ago

I checked some workers log message, seems when workers start, there will be this debug message. For example, the worker in openqaworker5 start successfully and without any network issue, also has this message:

Dec 30 06:28:42 openqaworker5 worker[37495]: [2020-12-30T06:28:42.667 CET] [debug] Unable to serialize fatal error: Can't open file "base_state.json": Permis>
Dec 30 06:28:42 openqaworker5 worker[37495]: [info] [pid:37495] worker 2:
Dec 30 06:28:42 openqaworker5 worker[37495]:  - config file:           /etc/openqa/workers.ini
Dec 30 06:28:42 openqaworker5 worker[37495]:  - worker hostname:       openqaworker5
Dec 30 06:28:42 openqaworker5 worker[37495]:  - isotovideo version:    20
Dec 30 06:28:42 openqaworker5 worker[37495]:  - websocket API version: 1
Dec 30 06:28:42 openqaworker5 worker[37495]:  - web UI hosts:          openqa.suse.de
Dec 30 06:28:42 openqaworker5 worker[37495]:  - class:                 qemu_x86_64,qemu_x86_64_staging,tap,openqaworker5
Dec 30 06:28:42 openqaworker5 worker[37495]:  - no cleanup:            no
Dec 30 06:28:42 openqaworker5 worker[37495]:  - pool directory:        /var/lib/openqa/pool/2
Dec 30 06:28:42 openqaworker5 worker[37495]: [info] [pid:37495] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa.suse.de
Dec 30 06:28:42 openqaworker5 worker[37495]: [info] [pid:37495] Project dir for host openqa.suse.de is /var/lib/openqa/share
Dec 30 06:28:42 openqaworker5 worker[37495]: [info] [pid:37495] Registering with openQA openqa.suse.de
Dec 30 06:28:42 openqaworker5 worker[37495]: [info] [pid:37495] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/961
Dec 30 06:28:43 openqaworker5 worker[37495]: [info] [pid:37495] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 961
Dec 30 10:28:29 openqaworker5 worker[37495]: [debug] [pid:37495] Accepting job 5231041 from openqa.suse.de.
Dec 30 10:28:29 openqaworker5 worker[37495]: [debug] [pid:37495] Setting job 5231041 from openqa.suse.de up
Dec 30 10:28:29 openqaworker5 worker[37495]: [debug] [pid:37495] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Dec 30 10:28:29 openqaworker5 worker[37495]: [info] [pid:37495] +++ setup notes +++

This function is called by https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433.
should the base_state.json be written in the working directory? such as /var/lib/openqa/pool/1. I am not sure if Mojo::File->new(STATE_FILE) will create the file in the working directory at that time.
And in my local environment, the information that should be written to base_state.json is:

Can\'t locate IO/Socket/Socks.pm in @INC (you may need to install the IO::Socket::Socks module) (@INC contains: /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Client.pm line 19
  1. could we install IO::Socket::Socks? or defined $ENV{MOJO_NO_SOCKS}? so seems the Can't locate message will disappear. but if there is another error message, the debug message will happen again.
  2. could we simply remove this line https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433 and the related code about parameter $fatal_error?

#8 Updated by cdywan 5 months ago

Xiaojing_liu wrote:

This function is called by https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433. 
should the base_state.json be written in the working directory? such as `/var/lib/openqa/pool/1`. I am not sure if `Mojo::File->new(STATE_FILE)` will create the file in the working directory at that time.

This error Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86. suggests the file can't be written due to missing permissions. So either it's indeed in the wrong folder, or the right folder is not setup correctly. I would suggest the first step is to include the folder in the error message.

And in my local environment, the information that should be written to base_state.json is:

Can\'t locate IO/Socket/Socks.pm in @INC (you may need to install the IO::Socket::Socks module) (@INC contains: /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Client.pm line 19
  1. could we install IO::Socket::Socks? or defined $ENV{MOJO_NO_SOCKS}? so seems the Can't locate message will disappear. but if there is another error message, the debug message will happen again.

How did you reproduce that? I guess Mojo might be using IO::Socket::Socks somewhere? Did you double-check that you have all of the dependencies we require installed?

  1. could we simply remove this line https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433 and the related code about parameter $fatal_error?

That line ensures we save the failure to JSON, and if this fails it would failed everywhere we're trying to save state to disk. But see my remark regarding the directory above.

#9 Updated by Xiaojing_liu 5 months ago

cdywan wrote:

Xiaojing_liu wrote:

This function is called by https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433. 
should the base_state.json be written in the working directory? such as `/var/lib/openqa/pool/1`. I am not sure if `Mojo::File->new(STATE_FILE)` will create the file in the working directory at that time.

This error Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86. suggests the file can't be written due to missing permissions. So either it's indeed in the wrong folder, or the right folder is not setup correctly. I would suggest the first step is to include the folder in the error message.

It's in the wrong folder. Because this message occurs when doing openQA::Worker->new, the command isotovideo --version will be run at that code. So the working directory is not /var/lib/openqa/pool/x.

And in my local environment, the information that should be written to base_state.json is:

Can\'t locate IO/Socket/Socks.pm in @INC (you may need to install the IO::Socket::Socks module) (@INC contains: /usr/lib/os-autoinst /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Client.pm line 19
  1. could we install IO::Socket::Socks? or defined $ENV{MOJO_NO_SOCKS}? so seems the Can't locate message will disappear. but if there is another error message, the debug message will happen again.

How did you reproduce that? I guess Mojo might be using IO::Socket::Socks somewhere? Did you double-check that you have all of the dependencies we require installed?

Just start openqa-worker service, and check the log message. it can be reproduced 100%. I don't find IO::Socket::Socks in openQA/dependencies.yaml. I guess we don't require it.

  1. could we simply remove this line https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433 and the related code about parameter $fatal_error?

That line ensures we save the failure to JSON, and if this fails it would failed everywhere we're trying to save state to disk. But see my remark regarding the directory above.

I am not sure if it would failed everywhere, but it will not fail if the working directory is the correct directory.

#10 Updated by cdywan 5 months ago

Xiaojing_liu wrote:

cdywan wrote:

Xiaojing_liu wrote:

This function is called by https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L433. 
should the base_state.json be written in the working directory? such as `/var/lib/openqa/pool/1`. I am not sure if `Mojo::File->new(STATE_FILE)` will create the file in the working directory at that time.

This error Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86. suggests the file can't be written due to missing permissions. So either it's indeed in the wrong folder, or the right folder is not setup correctly. I would suggest the first step is to include the folder in the error message.

It's in the wrong folder. Because this message occurs when doing openQA::Worker->new, the command isotovideo --version will be run at that code. So the working directory is not /var/lib/openqa/pool/x.

This made me stop and think, why is this even relevant here? We're not executing a test, just getting the version number. We don't have any state to write.

https://github.com/os-autoinst/os-autoinst/pull/1611

#11 Updated by tinita 5 months ago

Just a note about the Can't locate IO/Socket/Socks.pm error message: this appears because of $SIG{__DIE__} = sub { $fatal_error = shift }; in https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L65
Normally it would be ignored, see https://metacpan.org/source/SRI/Mojolicious-8.70/lib%2FMojo%2FIOLoop%2FClient.pm#L19
So IO::Socket::Socks isn't required, but because of the global SIGDIE handler $fatal_error gets set.

#12 Updated by cdywan 5 months ago

tinita wrote:

Just a note about the Can't locate IO/Socket/Socks.pm error message: this appears because of $SIG{__DIE__} = sub { $fatal_error = shift }; in https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L65
Normally it would be ignored, see https://metacpan.org/source/SRI/Mojolicious-8.70/lib%2FMojo%2FIOLoop%2FClient.pm#L19
So IO::Socket::Socks isn't required, but because of the global SIGDIE handler $fatal_error gets set.

That actually makes sense. My fix was more of a work-around in hindsight. I tested it out, and the file isn't written anymore after moving the handler down!

#13 Updated by kraih 5 months ago

tinita wrote:

Just a note about the Can't locate IO/Socket/Socks.pm error message: this appears because of $SIG{__DIE__} = sub { $fatal_error = shift }; in https://github.com/os-autoinst/os-autoinst/blob/master/isotovideo#L65
Normally it would be ignored, see https://metacpan.org/source/SRI/Mojolicious-8.70/lib%2FMojo%2FIOLoop%2FClient.pm#L19
So IO::Socket::Socks isn't required, but because of the global SIGDIE handler $fatal_error gets set.

Ouch, that global $SIG{__DIE__} should really be fixed. So many ways that can backfire.

#14 Updated by Xiaojing_liu 5 months ago

  • Assignee changed from Xiaojing_liu to cdywan

I think cdywan pull request (https://github.com/os-autoinst/os-autoinst/pull/1611) can fix this issue, so assign this ticket to him.

#15 Updated by tinita 5 months ago

The SIGDIE handler was introduced as part of this issue: https://progress.opensuse.org/issues/63718#note-15

#16 Updated by cdywan 5 months ago

  • Status changed from In Progress to Feedback

#17 Updated by cdywan 5 months ago

  • Status changed from Feedback to Resolved

I double-checked with Jane, this should be fine now. The signal handler was moved, so the version call has no side-effects and we have a unit test checking that now

Also available in: Atom PDF