action #122458
Updated by okurz almost 2 years ago
## Observation In O3 web UI, initially rebel:5 shows 'offline' status, I `systemctl restart openqa-worker@5` on rebel, then the worker status in web UI changed to 'broken' as below. ``` rebel:5 rebel 64bit-ipmi_rebel,64bit-ipmi-large-mem_rebel,64bit-ipmi-amd_rebel,blackbauhinia_rebel x86_64 **Broken** 1 34 ``` Here is the failure: ``` 422138 Dec 26 08:52:57 rebel worker[24670]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382 422139 Dec 26 08:52:57 rebel worker[6598]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 10> 422140 Dec 26 08:52:57 rebel worker[24670]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382 422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked 422142 Dec 26 08:52:57 rebel worker[24670]: at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 757. 422143 Dec 26 08:52:57 rebel worker[24670]: OpenQA::Worker::_lock_pool_directory(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/o> 422144 Dec 26 08:52:57 rebel worker[24670]: eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 745 ... 422161 Dec 26 08:52:57 rebel worker[24670]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/openqa/script/wor> 422162 Dec 26 08:52:57 rebel worker[24670]: - checking again for web UI 'http://openqa1-opensuse' in 100.00 s ``` Could you help to fix the failure, or could you point me how to fix it? ## Acceptance criteria * **AC1:** The openQA worker instance rebel:5 passes openQA jobs ## Suggestions * Log into the machine "rebel" part of the o3 infrastructure, check process table, check logs, check files in pool directory. Try a reboot of the machine, monitor openQA jobs on the instance. Look for crashes of isotovideo or the openQA worker and any left-over lock files.