action #132167

Updated by okurz 11 months ago

## Observation 

 This job failed in the uploading phase 
 [2023-06-29T18:11:42.386959+02:00] [error] Failed uploading asset 
 [2023-06-29T18:11:42.387207+02:00] [error] Error uploading opensuse-Tumbleweed-x86_64@64bit-2G-:TW.21297.qcow2: 502 response: Bad Gateway 
 At that time the webui was apparently stopped and restarted: 
 Jun 29 16:11:42 ariel systemd[1]: Stopping The openQA web UI... 
 Jun 29 16:11:42 ariel systemd[1]: openqa-webui.service: Deactivated successfully. 
 Jun 29 16:11:42 ariel systemd[1]: Stopped The openQA web UI. 
 Jun 29 16:13:12 ariel systemd[1]: Started The openQA web UI. 
 I would expect nginx to return a 503 in that case, and the asset uploading should retry. 

 From /var/log/zypp/history we can see that the service was restarted during package upgrade: 

 2023-06-29 16:11:33|command|root@ariel|'zypper' '-n' '--no-refresh' 'dup' '--auto-agree-with-licenses' '--download-in-advance'| 
 2023-06-29 16:11:34|install|openQA-common|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|1791b2e78db52c05fd9adc1ace2ac7b4451ec328d1558141796d90b8ec812c52| 
 2023-06-29 16:11:34|install|openQA-munin|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|b35e7151cd7b3b7f4224ae5244763267d29a20a48b5ea64a85ea397b13219bd6| 
 2023-06-29 16:11:36|install|openQA-continuous-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|910b7875523e37ae0db7c76cab74fe9d4113fcd2c07fe4a4578e2fd7a67c2d21| 
 2023-06-29 16:11:37|install|openQA-client|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|490b772224e5a64f354db398199ab23789fa7307d5edd8c561381a7dfe107e60| 
 2023-06-29 16:11:38|install|openQA-auto-update|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|6699637a855eccc6d5420c0c1129623da9c33096b25130ec9914963c9a526065| 
 2023-06-29 16:13:13|install|openQA|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|d137cfea77fc8567cfd7f9d5a5ccb372a54670da7e33ce5a3de293d53bb3c7f4| 
 2023-06-29 16:13:14|install|openQA-local-db|4.6.1688038171.b33d284-lp154.5926.1|x86_64||devel_openQA|352062e0def659d7b349084300465b0161e5083b5a167a41ca49b312eb6cca63| 

 But 1.5m is quite long. What caused that? 

 From /var/log/openqa.1.xz: 

 [2023-06-29T16:11:42.554141Z] [info] [pid:890] Worker 661 websocket connection closed - 1006 
 [2023-06-29T16:11:42.554307Z] [info] [pid:890] Worker 642 websocket connection closed - 1006 
 [2023-06-29T16:11:42.554567Z] [info] [pid:890] Worker 505 websocket connection closed - 1006 
 [2023-06-29T16:11:45.099867Z] [info] Listening at "" <-- liveview handler 
 [2023-06-29T16:11:45.100547Z] [info] Listening at "http://[::1]:9528" 
 (90s delay) 
 [2023-06-29T16:13:14.544271Z] [info] Listening at "" <-- websocket 
 [2023-06-29T16:13:14.544929Z] [info] Listening at "http://[::1]:9527" 
 [2023-06-29T16:13:14.918142Z] [info] Loading external plugin AMQP 
 [2023-06-29T16:13:14.973300Z] [info] Loading external plugin ObsRsync 
 [2023-06-29T16:13:15.527287Z] [info] Listening at "" <-- webUI 
 [2023-06-29T16:13:15.527842Z] [info] Listening at "http://[::1]:9526" 
 [2023-06-29T16:13:15.527939Z] [info] Manager 8148 started 
 [2023-06-29T16:13:15.537953Z] [info] Worker 8301 started 

 From /var/log/openqa_scheduler 

 [2023-06-29T16:11:42.131776Z] [debug] [pid:889] Scheduler took 0.77747s to perform operations and allocated 0 jobs 
 [2023-06-29T16:11:42.131955Z] [debug] [pid:889] [rescheduling] Current tick is at 0 ms. New tick will be in: 20000 ms 
 (90s delay) 
 [2023-06-29T16:13:14.675640Z] [info] Listening at "" 
 [2023-06-29T16:13:14.676281Z] [info] Listening at "http://[::1]:9529" 

 Where does the 90s delay come from? 

 Maybe we need to bump the timeouts within nginx to be a bit longer what we had in apache? 

 ## Acceptance criteria 
 * **AC1:** Automatic package upgrades of openQA cause no confusing error reports to users 
 * **AC2:** Automatic package upgrades of openQA do not cause incomplete openQA jobs 

 ## Suggestions 
 * Apache returns 503 if the openQA webUI, why does nginx return 502? 
 * Ensure that openQA jobs do not incomplete in upload phase when the webUI is absent for some seconds 
 * Investigate what took the webUI 1.5m to restart