Project

General

Profile

action #122458

Updated by okurz almost 2 years ago

## Observation 
 In O3 web UI, initially rebel:5 shows 'offline' status, I `systemctl restart openqa-worker@5` on rebel, then the worker status in web UI changed to 'broken' as below. 

 ``` 
 rebel:5      	 rebel      	 64bit-ipmi_rebel,64bit-ipmi-large-mem_rebel,64bit-ipmi-amd_rebel,blackbauhinia_rebel      	 x86_64      	 **Broken**        	 1      	 34 
 ``` 

 Here is the failure: 
 ``` 
  422138 Dec 26 08:52:57 rebel worker[24670]: [info] Establishing ws connection via ws://openqa1-opensuse/api/v1/ws/382 
  422139 Dec 26 08:52:57 rebel worker[6598]: [warn] Websocket connection to http://openqa1-opensuse/api/v1/ws/382 finished by remote side with code 10> 
  422140 Dec 26 08:52:57 rebel worker[24670]: [info] Registered and connected via websockets with openQA host http://openqa1-opensuse and worker ID 382 
  422141 Dec 26 08:52:57 rebel worker[24670]: [warn] Unable to lock pool directory: /var/lib/openqa/pool/5 already locked 
  422142 Dec 26 08:52:57 rebel worker[24670]:    at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 757. 
  422143 Dec 26 08:52:57 rebel worker[24670]:           OpenQA::Worker::_lock_pool_directory(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/o> 
  422144 Dec 26 08:52:57 rebel worker[24670]:           eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 745 
 ... 
  422161 Dec 26 08:52:57 rebel worker[24670]:           OpenQA::Worker::exec(OpenQA::Worker=HASH(0x560fa90f2828)) called at /usr/share/openqa/script/wor> 
  422162 Dec 26 08:52:57 rebel worker[24670]:    - checking again for web UI 'http://openqa1-opensuse' in 100.00 s 
 ```         

 Could you help to fix the failure, or could you point me how to fix it? 

 ## Acceptance criteria 
 * **AC1:** The openQA worker instance rebel:5 passes openQA jobs 

 ## Suggestions 
 * Log into the machine "rebel" part of the o3 infrastructure, check process table, check logs, check files in pool directory. Try a reboot of the machine, monitor openQA jobs on the instance. Look for crashes of isotovideo or the openQA worker and any left-over lock files.

Back