Project

General

Profile

action #57620

Updated by okurz over 4 years ago

## Motivation 
 After the weekly deployment of OSD I found during the restart of the openQA services that many jobs continued running however jobs that were in state "uploading" were aborted as incomplete, e.g. 
 https://openqa.suse.de/tests/3425424 
 I assume that before "worker refactoring in 2019" that behaviour was more resilient and less or no jobs would incomplete in this situation. In any case jobs during upload should retry uploading and the uploads should be accepted over an web UI (or websockets) restart and outage of some seconds. 

 ## Further details 

 ### logs log files 

 autoinst-log.txt and worker-log.txt of the job are attached. 

 Journal entries on OSD from the restart time window: 

 ``` 
 Oct 02 11:09:11 openqa systemd[1]: Stopping Handler for live view in openQA's web UI... 
 Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://127.0.0.1:9528 
 Oct 02 11:09:11 openqa openqa-livehandler[9119]: Server available at http://[::1]:9528 
 Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA web UI... 
 Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA daemon for various background tasks like cleanup and saving needles... 
 Oct 02 11:09:11 openqa systemd[1]: Stopped Enqueues an audit event cleanup task for the openQA database every day.. 
 Oct 02 11:09:11 openqa systemd[1]: Stopping Enqueues an audit event cleanup task for the openQA database every day.. 
 Oct 02 11:09:11 openqa systemd[1]: Started Enqueues an audit event cleanup task for the openQA database every day.. 
 Oct 02 11:09:11 openqa systemd[1]: Stopped Handler for live view in openQA's web UI. 
 Oct 02 11:09:11 openqa systemd[1]: Started Handler for live view in openQA's web UI. 
 Oct 02 11:09:11 openqa systemd[1]: Stopped The openQA web UI. 
 Oct 02 11:09:11 openqa systemd[1]: Stopping The openQA WebSockets server... 
 Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA WebSockets server. 
 Oct 02 11:09:13 openqa systemd[1]: Stopping The openQA Scheduler... 
 Oct 02 11:09:13 openqa openqa-scheduler[27611]: Warning: unable to close filehandle GEN5 properly: No space left on device during global destruction. 
 Oct 02 11:09:13 openqa systemd[1]: Stopped The openQA Scheduler. 
 Oct 02 11:09:13 openqa systemd[1]: Started The openQA Scheduler. 
 Oct 02 11:09:13 openqa systemd[1]: Started The openQA WebSockets server. 
 Oct 02 11:09:13 openqa systemd[1]: Started The openQA web UI. 
 Oct 02 11:09:15 openqa openqa-livehandler[17859]: BEGIN WORK 
 Oct 02 11:09:15 openqa openqa-livehandler[17859]: COMMIT 
 Oct 02 11:09:15 openqa systemd[1]: Stopped The openQA daemon for various background tasks like cleanup and saving needles. 
 Oct 02 11:09:15 openqa systemd[1]: Started The openQA daemon for various background tasks like cleanup and saving needles. 
 Oct 02 11:09:16 openqa openqa[17862]: BEGIN WORK 
 Oct 02 11:09:16 openqa openqa[17862]: COMMIT 
 Oct 02 11:09:17 openqa openqa[17862]: Server available at http://127.0.0.1:9526 
 Oct 02 11:09:17 openqa openqa[17862]: Server available at http://[::1]:9526 
 ``` 



 ### chat log 
 This was also discussed in https://chat.suse.de/group/openqa-dev?msg=ckosYAkuaob5coXCE , from there: 

 ``` 
 mkittler:The worker log shows multiple problems: 

 * The job is stopped during the upload phase. Specifically, while still uploading results for the first module isotovideo has already been exited. That makes the log look strange on the first look but it actually shouldn't be a problem. 
 * OSD was not reachable for some time during the upload (proxy error / service unavailable) [okurz: this was during the restart of services from osd deployment]. However, it was only unreachable for less than a minutes so the worker didn't give up. 
 * In the end of the upload OSD was unreachable again also affecting the WS connection. 
 * It is strange that the job is set to done in the middle of the upload: 

 ``` 
 Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-6.txt 
 Okt 02 09:57:26 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-7.icmp4-uni-basic05.txt 
 Okt 02 09:57:26 openqaworker8 worker[18624]: [info] [pid:9219] Uploading serial_terminal.txt 
 Okt 02 09:57:27 openqaworker8 worker[18624]: [info] [pid:9219] Uploading worker-log.txt 
 Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] Setting job 3425424 to done 
 Okt 02 09:57:28 openqaworker8 worker[18624]: [debug] [pid:18624] REST-API call: POST http://openqa.suse.de/api/v1/jobs/3425424/status 
 Okt 02 09:57:29 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact LTP_net_stress.ipsec_icmp_icmp4-uni-basic05.txt 
 Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:8935] Uploading artefact icmp4-uni-basic05-8.txt 
 Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/shutdown_ltp-57.png 
 Okt 02 09:57:30 openqaworker8 worker[18624]: [debug] [pid:9395] Uploading artefact shutdown_ltp-57.png as 88ab69ccbd5248aabe46d7afcef7c3f0 
 Okt 02 09:57:32 openqaworker8 worker[18624]: [debug] [pid:9395] Optimizing /var/lib/openqa/pool/8/testresults/.thumbs/shutdown_ltp-57.png 
 ``` 

 So there were upload problems caused by OSD being unavailable and maybe there's also a logic error in the worker code because it shouldn't set the job to done in the middle of the upload. That maybe has not worked anyways and therefore the job has been incompleted by the web UI in the end. 
 okurz: well, the web UI was restarted due to upgrade, many jobs that were "running" were still continuing. So I guess the problem is "job is incomplete if websockets server (or webui?) is unreachable for a minute, e.g. during upgrade". reopen a ticket or will you create a new one? 
 mkittler: If the web UI is offline long enough the job is actually expected to be incomplete. So that's actually expected. I'm more wondering about the worker itself because setting the job to done in the middle of the upload seems wrong. But if I don't look in the journal log but just at https://openqa.suse.de/tests/3425424/file/worker-log.txt there are lines missing. E.g. the REST call to set the job to done is not logged here. Maybe it didn't happen after all? 
 "long enough" is defined by the worker's patience for the web UI before stopping a job. It is also defined by the the web UIs patience to wait for a worker which doesn't respond. The worker has a retry limit for REST calls and waits a certain amount of time between them. If you look at the code you will find the exact numbers. For the web UI there's a variable defining a threshold. I guess it was 2 minutes. 
 ```

Back