action #92770
Updated by okurz over 3 years ago
## Observation
https://openqa.opensuse.org is not reachable, no response within a browser. I can login over ssh but `curl http://localhost` also does not return in time.
`systemctl status` shows no failed services
`systemctl status openqa-webui` shows
```
May 18 03:08:44 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:08:44 ariel openqa-webui-daemon[1992]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 03:14:38 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:14:38 ariel openqa-webui-daemon[1992]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 03:29:19 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:29:19 ariel openqa-webui-daemon[1992]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 04:47:00 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 04:47:00 ariel openqa-webui-daemon[1992]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 04:57:38 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 04:57:38 ariel openqa-webui-daemon[1992]: at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129
```
```
# ps auxf | grep '\<D\>'
…
geekote+ 10155 30.2 1.0 351884 178828 ? D 05:23 0:16 \_ /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -c 1 -G 800
```
A restart with `systemctl restart openqa-webui` seems to have been fine but no improvement.
`journalctl -f` shows
```
-- Logs begin at Wed 2021-05-05 09:40:21 UTC. --
May 18 05:27:43 ariel nrpe[11937]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:43 ariel nrpe[11937]: Could not read request from client , bailing out...
May 18 05:27:43 ariel nrpe[11937]: INFO: SSL Socket Shutdown.
May 18 05:27:43 ariel nrpe[11947]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:43 ariel nrpe[11947]: Could not read request from client , bailing out...
May 18 05:27:43 ariel nrpe[11947]: INFO: SSL Socket Shutdown.
May 18 05:27:45 ariel dnsmasq-dhcp[1735]: DHCPDISCOVER(eth1) 00:25:90:83:f8:70 no address available
May 18 05:27:46 ariel nrpe[11959]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:46 ariel nrpe[11959]: Could not read request from client , bailing out...
May 18 05:27:46 ariel nrpe[11959]: INFO: SSL Socket Shutdown.
May 18 05:27:55 ariel nrpe[11972]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:55 ariel nrpe[11972]: Could not read request from client , bailing out...
May 18 05:27:55 ariel nrpe[11972]: INFO: SSL Socket Shutdown.
May 18 05:27:55 ariel nrpe[11973]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:55 ariel nrpe[11973]: Could not read request from client , bailing out...
May 18 05:27:55 ariel nrpe[11973]: INFO: SSL Socket Shutdown.
May 18 05:27:57 ariel nrpe[11987]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:57 ariel nrpe[11987]: Could not read request from client , bailing out...
May 18 05:27:57 ariel nrpe[11987]: INFO: SSL Socket Shutdown.
May 18 05:27:59 ariel nrpe[11994]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:59 ariel nrpe[11994]: Could not read request from client , bailing out...
May 18 05:27:59 ariel nrpe[11994]: INFO: SSL Socket Shutdown.
```
## Rollback after root cause resolved
* enable job_done hooks again in o3 /etc/openqa/openqa.ini
* start openqa-scheduler
* ensure all incomplete jobs are handled
* run auto-review manually, e.g. in https://gitlab.suse.de/openqa/auto-review/pipelines
* crosscheck incomplete and failed jobs manually
* start additional worker instances again `for i in aarch64 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "reboot" ; done`
* set status on https://status.opensuse.org/dashboard back to Operational with result