Project

General

Profile

action #81228

many o3 workers not working on jobs as of 2020-12-21 (w4 seems to be ok)

Added by okurz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2020-12-21
Due date:
% Done:

0%

Estimated time:

Description

Observation

As reported https://matrix.to/#/!ZzdoIxkFyLxnxddLqM:matrix.org/$16085395852091876OpEHv:matrix.org?via=matrix.org&via=opensuse.org&via=t2bot.io "DimStar
Martchus: okurz[m]: to me it seems something is wrong with 03; ow1 for example is not working on any job, despite there being a ton of scheduled ones left; only ow4 seems to pick up jobs".

History

#1 Updated by okurz 7 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz

No failed systemd services on openqaworker1.

Process tree looked like all services were running and also one job (but just one) was running:

_openqa+  3037  0.0  0.0 157876 67224 ?        Ss   03:34   0:06 /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  3309  0.0  0.0 151868 64348 ?        S    03:34   0:08  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  3310  0.0  0.0 152016 64608 ?        S    03:34   0:07  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  3311  0.0  0.0 152028 64664 ?        S    03:34   0:08  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  3312  0.0  0.0 151868 64348 ?        S    03:34   0:08  \_ /usr/bin/perl /usr/share/openqa/script/openqa-workercache prefork -m production -i 100 -H 400 -w 4 -G 80
_openqa+  3056  0.0  0.0 149680 67296 ?        Ss   03:34   0:21 /usr/bin/perl /usr/share/openqa/script/openqa-workercache run -m production --reset-locks
_openqa+  3061  0.0  0.0 141988 62000 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 3
_openqa+  3062  0.0  0.0 141992 62044 ?        Ss   03:34   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 14
_openqa+  3063  0.0  0.0 141988 61984 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 10
_openqa+  3064  0.0  0.0 141992 62196 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 16
_openqa+  3065  0.0  0.0 141992 62268 ?        Ss   03:34   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 15
_openqa+  3066  0.1  0.0 221152 137056 ?       Ss   03:34   0:26 /usr/bin/perl /usr/share/openqa/script/worker --instance 17
_openqa+ 12945  0.8  0.0 342376 229544 ?       S    09:46   0:04  \_ perl /usr/bin/isotovideo -d
_openqa+ 12950  0.0  0.0 184448 85404 ?        S    09:46   0:00      \_ /usr/bin/isotovideo: comman
_openqa+ 12953  1.1  0.0 3833448 208708 ?      Sl   09:46   0:06      \_ /usr/bin/isotovideo: autote
_openqa+ 13016 23.4  0.1 3864892 440316 ?      Sl   09:46   2:07      \_ /usr/bin/isotovideo: backen
_openqa+ 13079  5.6  0.0 1453272 59000 ?       SNl  09:46   0:30          \_ /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/17/video.ogv
_openqa+ 13123 78.0  0.4 4080072 1142068 ?     Sl   09:46   7:04          \_ /usr/bin/qemu-system-x86_64 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.driveA= -m 1024 -cpu qe
_openqa+  3067  0.0  0.0 141988 61912 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 7
_openqa+  3068  0.0  0.0 141992 61948 ?        Ss   03:34   0:03 /usr/bin/perl /usr/share/openqa/script/worker --instance 13
_openqa+  3069  0.0  0.0 141992 62116 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 18
_openqa+  3070  0.0  0.0 141988 62132 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 5
_openqa+  3071  0.0  0.0 142048 62124 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 19
_openqa+  3072  0.0  0.0 142040 61916 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 6
_openqa+  3073  0.0  0.0 141988 61828 ?        Ss   03:34   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 4
_openqa+  3074  0.0  0.0 142052 62116 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
_openqa+  3075  0.0  0.0 141988 62104 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 11
_openqa+  3076  0.0  0.0 142036 61956 ?        Ss   03:34   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 2
_openqa+  3077  0.0  0.0 141988 61844 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 12
_openqa+  3078  0.0  0.0 141988 61888 ?        Ss   03:34   0:02 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
_openqa+  3079  0.0  0.0 141992 62036 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 20
_openqa+  3080  0.0  0.0 142040 62128 ?        Ss   03:34   0:01 /usr/bin/perl /usr/share/openqa/script/worker --instance 9

Output of openqa-worker@1 journal:

-- Reboot --
Dec 21 03:34:31 openqaworker1 systemd[1]: Starting openQA Worker #1...
Dec 21 03:34:31 openqaworker1 systemd[1]: Started openQA Worker #1.
Dec 21 03:34:34 openqaworker1 worker[3074]: [2020-12-21T03:34:34.174 CET] [debug] Unable to serialize fatal error: Can't open file "base_state.json": Read-only file system at /usr/lib/os-autoinst/bmwqemu.pm line 86.
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] worker 1:
Dec 21 03:34:34 openqaworker1 worker[3074]:  - config file:           /etc/openqa/workers.ini
Dec 21 03:34:34 openqaworker1 worker[3074]:  - worker hostname:       openqaworker1
Dec 21 03:34:34 openqaworker1 worker[3074]:  - isotovideo version:    20
Dec 21 03:34:34 openqaworker1 worker[3074]:  - websocket API version: 1
Dec 21 03:34:34 openqaworker1 worker[3074]:  - web UI hosts:          http://openqa1-opensuse
Dec 21 03:34:34 openqaworker1 worker[3074]:  - class:                 openqaworker1,qemu_x86_64,pool_is_hdd,heavyload,kubic_x86_64,tap
Dec 21 03:34:34 openqaworker1 worker[3074]:  - no cleanup:            no
Dec 21 03:34:34 openqaworker1 worker[3074]:  - pool directory:        /var/lib/openqa/pool/1
Dec 21 03:34:34 openqaworker1 worker[3074]: [error] Worker cache not available: Cache service info error: Connection refused
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] Registering with openQA http://openqa1-opensuse
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/51
Dec 21 03:34:34 openqaworker1 worker[3074]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 51
Dec 21 04:00:22 openqaworker1 worker[3074]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/51 finished by remote side with code 1006, no reason - trying again in 10 seconds
Dec 21 04:00:32 openqaworker1 worker[3074]: [info] Registering with openQA http://openqa1-opensuse
Dec 21 04:00:32 openqaworker1 worker[3074]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/51
Dec 21 04:00:33 openqaworker1 worker[3074]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 51

likely the error "[error] Worker cache not available: Cache service info error: Connection refused" is the problem here and no reconnection is done. I would have expected a different error message by now considering that https://github.com/os-autoinst/openQA/pull/3646 should be part of it already. Maybe that PR is even the culprit here?

I did systemctl restart openqa-worker.target on openqaworker1 and jobs seem to be picked up properly again.

#2 Updated by okurz 7 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

Also did systemctl restart openqa-worker.target on openqaworker7 now.

aarch64 showed that the systemd service openqa-worker-cacheservice-minion failed, journal showed:

Dec 21 09:17:46 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7884] [i] [#21655] Finished download
Dec 21 09:17:56 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7902] [i] [#21656] Downloading: "openSUSE-Leap-15.3-DVD-aarch64-Build37.2-Media.iso"
Dec 21 09:17:56 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7902] [i] [#21656] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:17:56 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7902] [i] [#21656] Downloading "openSUSE-Leap-15.3-DVD-aarch64-Build37.2-Media.iso" from "http://openqa1-opensuse/tests/1518124/asset/is>
Dec 21 09:17:56 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7902] [i] [#21656] Content of "/var/lib/openqa/cache/openqa1-opensuse/openSUSE-Leap-15.3-DVD-aarch64-Build37.2-Media.iso" has not change>
Dec 21 09:17:56 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7902] [i] [#21656] Finished download
Dec 21 09:18:06 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7925] [i] [#21657] Sync: "rsync://openqa1-opensuse/tests" to "/var/lib/openqa/cache/openqa1-opensuse"
Dec 21 09:18:06 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7925] [i] [#21657] Calling: rsync -avHP rsync://openqa1-opensuse/tests/ --delete /var/lib/openqa/cache/openqa1-opensuse/tests/
Dec 21 09:18:06 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: [7925] [i] [#21657] Finished sync: 0
Dec 21 09:34:41 openqa-aarch64 openqa-worker-cacheservice-minion[2911]: DBD::SQLite::st execute failed: unable to open database file at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 395.
Dec 21 09:34:41 openqa-aarch64 openqa-worker-cacheservice-minion[2911]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 395.
Dec 21 09:34:41 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:41 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:41 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:42 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:42 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:42 openqa-aarch64 systemd[1]: Started OpenQA Worker Cache Service Minion.
Dec 21 09:34:43 openqa-aarch64 openqa-worker-cacheservice-minion[10955]: [10955] [i] [GzrfSZ1P] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:34:43 openqa-aarch64 openqa-worker-cacheservice-minion[10955]: [10955] [i] Resetting all leftover locks after restart
Dec 21 09:34:43 openqa-aarch64 openqa-worker-cacheservice-minion[10955]: DBD::SQLite::st execute failed: attempt to write a readonly database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line >
Dec 21 09:34:43 openqa-aarch64 openqa-worker-cacheservice-minion[10955]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 327.
Dec 21 09:34:43 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:43 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:43 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:43 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:43 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:43 openqa-aarch64 systemd[1]: Started OpenQA Worker Cache Service Minion.
Dec 21 09:34:44 openqa-aarch64 openqa-worker-cacheservice-minion[10957]: [10957] [i] [emFYfBd9] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:34:44 openqa-aarch64 openqa-worker-cacheservice-minion[10957]: [10957] [i] Resetting all leftover locks after restart
Dec 21 09:34:44 openqa-aarch64 openqa-worker-cacheservice-minion[10957]: DBD::SQLite::st execute failed: attempt to write a readonly database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line >
Dec 21 09:34:44 openqa-aarch64 openqa-worker-cacheservice-minion[10957]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 327.
Dec 21 09:34:44 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:44 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:44 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:44 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:44 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:44 openqa-aarch64 systemd[1]: Started OpenQA Worker Cache Service Minion.
Dec 21 09:34:45 openqa-aarch64 openqa-worker-cacheservice-minion[10959]: [10959] [i] [mpkpQMG9] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:34:45 openqa-aarch64 openqa-worker-cacheservice-minion[10959]: [10959] [i] Resetting all leftover locks after restart
Dec 21 09:34:45 openqa-aarch64 openqa-worker-cacheservice-minion[10959]: DBD::SQLite::st execute failed: attempt to write a readonly database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line >
Dec 21 09:34:45 openqa-aarch64 openqa-worker-cacheservice-minion[10959]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 327.
Dec 21 09:34:45 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:45 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:45 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:45 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:45 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:45 openqa-aarch64 systemd[1]: Started OpenQA Worker Cache Service Minion.
Dec 21 09:34:46 openqa-aarch64 openqa-worker-cacheservice-minion[10961]: [10961] [i] [IzMRQ9y2] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:34:46 openqa-aarch64 openqa-worker-cacheservice-minion[10961]: [10961] [i] Resetting all leftover locks after restart
Dec 21 09:34:46 openqa-aarch64 openqa-worker-cacheservice-minion[10961]: DBD::SQLite::st execute failed: attempt to write a readonly database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line >
Dec 21 09:34:46 openqa-aarch64 openqa-worker-cacheservice-minion[10961]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 327.
Dec 21 09:34:46 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:46 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:46 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:47 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:47 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:47 openqa-aarch64 systemd[1]: Started OpenQA Worker Cache Service Minion.
Dec 21 09:34:48 openqa-aarch64 openqa-worker-cacheservice-minion[10963]: [10963] [i] [Wh5aRaLU] Cache size of "/var/lib/openqa/cache" is 158GiB, with limit 160GiB
Dec 21 09:34:48 openqa-aarch64 openqa-worker-cacheservice-minion[10963]: [10963] [i] Resetting all leftover locks after restart
Dec 21 09:34:48 openqa-aarch64 openqa-worker-cacheservice-minion[10963]: DBD::SQLite::st execute failed: attempt to write a readonly database at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line >
Dec 21 09:34:48 openqa-aarch64 openqa-worker-cacheservice-minion[10963]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Backend/SQLite.pm line 327.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Service RestartSec=100ms expired, scheduling restart.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Start request repeated too quickly.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: Failed to start OpenQA Worker Cache Service Minion.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Unit entered failed state.
Dec 21 09:34:48 openqa-aarch64 systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.

restarting the service obviously reproduces the problem. Also

rm -rf /var/lib/openqa/cache/

fails due to "Read-only file system". This seems to first appear in the system journal at:

Dec 21 09:11:19 openqa-aarch64 kernel: sas: Enter sas_scsi_recover_host busy: 24 failed: 24
Dec 21 09:11:19 openqa-aarch64 kernel: sas: ata1: end_device-0:0:2: cmd error handler
Dec 21 09:11:19 openqa-aarch64 kernel: sas: ata1: end_device-0:0:2: dev error handler
Dec 21 09:11:19 openqa-aarch64 kernel: sd 0:0:0:0: [sda] tag#3430 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT cmd_age=32s
Dec 21 09:11:19 openqa-aarch64 kernel: sd 0:0:0:0: [sda] tag#3430 CDB: Write(10) 2a 00 15 d7 49 38 00 00 20 00
Dec 21 09:11:19 openqa-aarch64 kernel: blk_update_request: I/O error, dev sda, sector 366430520 op 0x1:(WRITE) flags 0x0 phys_seg 4 prio class 0

Likely /dev/sda3 for mount point /var/lib/openqa should have a new filesystem created. Triggered reboot. Want to see how it behaves after reboot. looks better after reboot. Rest for others.

#3 Updated by mkittler 7 months ago

I would have expected a different error message by now considering that https://github.com/os-autoinst/openQA/pull/3646 should be part of it already.

Not really. If the cache service is not available at the beginning it is still supposed to show this message. However, I would expect to see the new error message after registration. Unless the error has vanished until then. DimStar said that the workers were shown as online and not as broken which is in-line with not seeing the error. I'll do some local testing, maybe I can reproduce the situation.

#4 Updated by mkittler 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

And by the way, the reboot of aarch64 seems to have helped. However, I'm ignoring that case for now because it looks totally different than the problem on openqaworker1 and openqaworker7 and unrelated to recent code changes.


Note that we actually also see the log message from https://github.com/os-autoinst/openQA/pull/3646 on openqaworker7:

-- Reboot --
Dez 21 03:37:03 openqaworker7 systemd[1]: Starting openQA Worker #1...
Dez 21 03:37:03 openqaworker7 systemd[1]: Started openQA Worker #1.
Dez 21 03:37:10 openqaworker7 worker[3738]: [2020-12-21T03:37:10.182 CET] [debug] Unable to serialize fatal error: Can't open file "base_state.json": Permission denied at /usr/lib/os-autoinst/bmwqemu.pm line 86.
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] worker 1:
Dez 21 03:37:10 openqaworker7 worker[3738]:  - config file:           /etc/openqa/workers.ini
Dez 21 03:37:10 openqaworker7 worker[3738]:  - worker hostname:       openqaworker7
Dez 21 03:37:10 openqaworker7 worker[3738]:  - isotovideo version:    20
Dez 21 03:37:10 openqaworker7 worker[3738]:  - websocket API version: 1
Dez 21 03:37:10 openqaworker7 worker[3738]:  - web UI hosts:          http://openqa1-opensuse
Dez 21 03:37:10 openqaworker7 worker[3738]:  - class:                 openqaworker7,qemu_x86_64
Dez 21 03:37:10 openqaworker7 worker[3738]:  - no cleanup:            no
Dez 21 03:37:10 openqaworker7 worker[3738]:  - pool directory:        /var/lib/openqa/pool/1
Dez 21 03:37:10 openqaworker7 worker[3738]: [error] Worker cache not available: Cache service info error: Connection refused
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] CACHE: caching is enabled, setting up /var/lib/openqa/cache/openqa1-opensuse
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] Project dir for host http://openqa1-opensuse is /var/lib/openqa/share
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] Registering with openQA http://openqa1-opensuse
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/193
Dez 21 03:37:10 openqaworker7 worker[3738]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 193
Dez 21 03:37:10 openqaworker7 worker[3738]: [warn] Worker cache not available: Cache service info error: Connection refused - checking again for web UI 'http://openqa1-opensuse' in 100.00 s
Dez 21 04:00:22 openqaworker7 worker[3738]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/193 finished by remote side with code 1006, no reason - trying again in 10 seconds
Dez 21 04:00:32 openqaworker7 worker[3738]: [info] Registering with openQA http://openqa1-opensuse
Dez 21 04:00:32 openqaworker7 worker[3738]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/193
Dez 21 04:00:33 openqaworker7 worker[3738]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 193
Dez 21 10:09:26 openqaworker7 worker[3738]: [info] Quit due to signal TERM

But the log message is only repeated one time. That means the code branch to re-attempt the cache service check is still executed (at least one time). Maybe the worker cache really didn't appear broken anymore. That would also explain why the worker showed up as online and not broken but still leaves the question why the worker didn't accept any jobs. Maybe the current error isn't cleared correctly.

#5 Updated by mkittler 7 months ago

It was really just that the "current error" is not cleared after a positive availability check during the status updates while the worker is idling. PR: https://github.com/os-autoinst/openQA/pull/3652

#6 Updated by okurz 7 months ago

  • Status changed from In Progress to Resolved

PR was merged and problems got resolved this way.

Also available in: Atom PDF