Project

General

Profile

action #92770

openqa.opensuse.org down, o3 VM reachable, no failed service

Added by okurz 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-05-18
Due date:
2021-06-02
% Done:

0%

Estimated time:

Description

Observation

https://openqa.opensuse.org is not reachable, no response within a browser. I can login over ssh but curl http://localhost also does not return in time.

systemctl status shows no failed services

systemctl status openqa-webui shows

May 18 03:08:44 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:08:44 ariel openqa-webui-daemon[1992]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 03:14:38 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:14:38 ariel openqa-webui-daemon[1992]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 03:29:19 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 03:29:19 ariel openqa-webui-daemon[1992]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 04:47:00 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 04:47:00 ariel openqa-webui-daemon[1992]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129.
May 18 04:57:38 ariel openqa-webui-daemon[1992]: Unhandled rejected promise: Publishing opensuse.openqa.job.done failed at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm line 86.
May 18 04:57:38 ariel openqa-webui-daemon[1992]:  at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 129
# ps auxf | grep '\<D\>'
…
geekote+ 10155 30.2  1.0 351884 178828 ?       D    05:23   0:16  \_ /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -c 1 -G 800

A restart with systemctl restart openqa-webui seems to have been fine but no improvement.

journalctl -f shows

-- Logs begin at Wed 2021-05-05 09:40:21 UTC. --
May 18 05:27:43 ariel nrpe[11937]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:43 ariel nrpe[11937]: Could not read request from client , bailing out...
May 18 05:27:43 ariel nrpe[11937]: INFO: SSL Socket Shutdown.
May 18 05:27:43 ariel nrpe[11947]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:43 ariel nrpe[11947]: Could not read request from client , bailing out...
May 18 05:27:43 ariel nrpe[11947]: INFO: SSL Socket Shutdown.
May 18 05:27:45 ariel dnsmasq-dhcp[1735]: DHCPDISCOVER(eth1) 00:25:90:83:f8:70 no address available
May 18 05:27:46 ariel nrpe[11959]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:46 ariel nrpe[11959]: Could not read request from client , bailing out...
May 18 05:27:46 ariel nrpe[11959]: INFO: SSL Socket Shutdown.
May 18 05:27:55 ariel nrpe[11972]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:55 ariel nrpe[11972]: Could not read request from client , bailing out...
May 18 05:27:55 ariel nrpe[11972]: INFO: SSL Socket Shutdown.
May 18 05:27:55 ariel nrpe[11973]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:55 ariel nrpe[11973]: Could not read request from client , bailing out...
May 18 05:27:55 ariel nrpe[11973]: INFO: SSL Socket Shutdown.
May 18 05:27:57 ariel nrpe[11987]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:57 ariel nrpe[11987]: Could not read request from client , bailing out...
May 18 05:27:57 ariel nrpe[11987]: INFO: SSL Socket Shutdown.
May 18 05:27:59 ariel nrpe[11994]: Error: (use_ssl == true): Request packet version was invalid!
May 18 05:27:59 ariel nrpe[11994]: Could not read request from client , bailing out...
May 18 05:27:59 ariel nrpe[11994]: INFO: SSL Socket Shutdown.

Rollback after root cause resolved

  • DONE: enable /usr/share/openqa/templates/webapi/main/group_overview.html.ep again
  • DONE: enable /usr/share/openqa/templates/webapi/test/overview.html.ep again
  • DONE: enable amqp again in o3 /etc/openqa/openqa.ini
  • DONE: enable job_done hooks again in o3 /etc/openqa/openqa.ini
  • DONE: start openqa-scheduler
  • DONE: ensure all incomplete jobs are handled
  • DONE: run auto-review manually, e.g. in https://gitlab.suse.de/openqa/auto-review/pipelines
  • DONE: crosscheck incomplete and failed jobs manually
  • DONE: start additional worker instances again for i in aarch64 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "systemctl start default.target" ; done
  • DONE: set status on https://status.opensuse.org/dashboard back to Operational with result

Related issues

Copied to openQA Project - coordination #92854: [epic] limit overload of openQA webUI by heavy requestsNew2021-06-12

History

#1 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to Immediate

#2 Updated by okurz 5 months ago

It looks like o3 is actually reachable but very slow. Restarting openqa-webui on its down does not seems to help, restarting apache2 helped once it seems. Maybe restarting apache2 openqa-webui together helps a bit better to flush occupied connections or something. As this was already reported yesterday evening at 2100Z by DimStar in https://matrix.to/#/!ZzdoIxkFyLxnxddLqM:matrix.org/$1621286006679174vbjTB:matrix.org the issue can't be explained by an upgrade from today, maybe the one from yesterday. Triggered a reboot of o3.

#3 Updated by okurz 5 months ago

reboot was quick and without problems but does not help.

Trying a downgrade, going back to packages from May 15 zypper install --oldpackage /var/cache/zypp/packages/devel_openQA/noarch/openQA{-common,-client,-auto-update,}-4.6.1620996956.bd2066072-lp152.3983.1.noarch.rpm

Reported an incident on status.opensuse.org for o3, linking to this ticket.

Recently installed packages, upgrades, updates, patches:

# less /var/log/zypp/history  | sed -n '/2021-05-16/,$p' | grep '|install|'
2021-05-16 21:23:29|install|libdevmapper1_03|1.02.163-lp152.7.24.1|x86_64||repo-update-oss|8b7f809d50d975208b0eff5a56bd2190281f65eec283797850b932e139a5fa20|
2021-05-16 21:23:29|install|liblvm2cmd2_03|2.03.05-lp152.7.24.1|x86_64||repo-update-oss|9c507cce552555dd0fc0b318c667d1daf073fada1f1c5239047cc037428665ab|
2021-05-16 21:23:30|install|libnuma1|2.0.14-lp152.5.6.1|x86_64||repo-update-oss|89c860c7442ca52309fe48b85c77cffc3a8e4fac60867abe931d6b9641c6e52d|
2021-05-16 21:23:30|install|libsensors4|3.5.0-lp152.2.3.1|x86_64||repo-update-oss|5852fbe1cb8ae50a8ba31d0c4983381bdbdd00e22e8de7a4233d30d43ebbd372|
2021-05-16 21:23:30|install|libdevmapper-event1_03|1.02.163-lp152.7.24.1|x86_64||repo-update-oss|81f47df6e2f739c3851b1b90a36d52ce455e045573eec7ac6162139b407667f5|
2021-05-16 21:23:31|install|device-mapper|1.02.163-lp152.7.24.1|x86_64||repo-update-oss|e6c373d3d83098d2b355cf30041fc6039e6d20a9f9e0f0e45414c254e71d9859|
2021-05-16 21:23:32|install|lvm2|2.03.05-lp152.7.24.1|x86_64||repo-update-oss|c6674d0679821353eadc2c6a2a3ec3d86865c218c1727c921dec709155304292|
2021-05-17 00:00:33|install|autofs|5.1.3-lp152.9.3.1|x86_64||repo-update-oss|5acff9f63fce509605bc0b91016dd1c56b24a317d98c0697ef961d14a3676f3c|
2021-05-17 00:00:33|install|dracut|049.1+suse.188.gbf445638-lp152.2.27.1|x86_64||repo-update-oss|72611151e390229c82a701bddf2311061d5b4a76884ebfb9ec6ad5c4e5f51178|
2021-05-17 03:00:15|install|openQA-auto-update|4.6.1620996956.bd2066072-lp152.3984.1|noarch||devel_openQA|d0a8d0f5b7442951beb473356b9fe7bebfb9e747311db88f91c8c270295e2c46|
2021-05-17 03:00:16|install|openQA-common|4.6.1620996956.bd2066072-lp152.3984.1|noarch||devel_openQA|d83db738565a943ddcce4f34260a65cdf9da0eb7c53191cc9cc7a7d922b9f2c3|
2021-05-17 03:00:16|install|openQA-client|4.6.1620996956.bd2066072-lp152.3984.1|noarch||devel_openQA|25dfc4bc784652d57807fb0900531aa4fb9f3e41ebaee7d581525a5033d90429|
2021-05-17 03:00:23|install|openQA|4.6.1620996956.bd2066072-lp152.3984.1|noarch||devel_openQA|5ecd093611b051e28f04d5f8353f456618149e7a5dc4383f7998b0715537d403|
2021-05-17 03:00:23|install|sed|4.4-lp152.5.3.1|x86_64||repo-update-oss|67dd4bdf2076bffc42eae33088ff29a1d14d4bcb490f3865b246d4e9bedfe81f|
2021-05-17 03:00:23|install|sed-lang|4.4-lp152.5.3.1|noarch||repo-update-oss|498e664abf5096cc3eef7dd6d62554ff4d067a92fd9f844291d7070d3cb2dc1d|
2021-05-18 03:00:49|install|openQA-auto-update|4.6.1621240003.931b86d3f-lp152.3985.1|noarch||devel_openQA|a2fde8d1f04aaad95673f4fb0f5deedc90c75861287ba600ed64705e5b357547|
2021-05-18 03:00:50|install|openQA-common|4.6.1621240003.931b86d3f-lp152.3985.1|noarch||devel_openQA|4d84f0209533de135a86b885b52d662f62291d6653c6904626c0b2d87425e871|
2021-05-18 03:00:50|install|openQA-client|4.6.1621240003.931b86d3f-lp152.3985.1|noarch||devel_openQA|22e8fa0bd8780a2a86cbd58df6c8e722c0e57ce13bcbd49b1e2177f42ea8af7c|
2021-05-18 03:01:12|install|openQA|4.6.1621240003.931b86d3f-lp152.3985.1|noarch||devel_openQA|8bff8dfedf317d2f18651ea6c936eef0012b1d86ca1b635f15487607f1dd7afd|
2021-05-18 03:01:15|install|suse-online-update|2.2-lp152.2.1|noarch||openSUSE:infrastructure|3fbde818db9b7372217daf557b9f56a9f2503b1fce9374d0d8a6d6792a279d92|
2021-05-18 05:59:47|install|openQA-common|4.6.1620996956.bd2066072-lp152.3983.1|noarch|root@ariel|_tmpRPMcache_||
2021-05-18 05:59:50|install|openQA-auto-update|4.6.1620996956.bd2066072-lp152.3983.1|noarch|root@ariel|_tmpRPMcache_||
2021-05-18 05:59:51|install|openQA-client|4.6.1620996956.bd2066072-lp152.3983.1|noarch|root@ariel|_tmpRPMcache_||
2021-05-18 06:00:08|install|openQA|4.6.1620996956.bd2066072-lp152.3983.1|noarch|root@ariel|_tmpRPMcache_||

All 10 CPU cores are fully loaded, mostly with the main openqa processes. I temporarily disabled both "job done hooks" in o3 /etc/openqa/openqa.ini in case these add to the load significantly. Stopping apache2 makes these processes go back to zero CPU usage soon, a local curl http://localhost:9526 is very responsive. I stopped openqa-scheduler to not schedule more jobs and not add to the mess.

#4 Updated by okurz 5 months ago

  • Description updated (diff)

#5 Updated by okurz 5 months ago

  • Description updated (diff)

Disabled most openqa worker instances to reduce the load

for i in aarch64 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "systemctl stop openqa-worker-auto-restart@\*" ; done

#6 Updated by okurz 5 months ago

  • Description updated (diff)

This is how strace from openqa webui processes looks like:

strace: Process 30934 attached
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0;\0dbdpg_p30934_2\0\0\0\0\3\0\0\0\010202"..., 82, MSG_NOSIGNAL, NULL, 0) = 82
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\3\242\0\37id\0\0\0@\205\0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 24576
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "6_64-Build20210512-kde-live_inst"..., 32706, 0, NULL, NULL) = 32706
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "eap_15.2_gnome\0\0\0\10opensuse\0\0\0\nTu"..., 32605, 0, NULL, NULL) = 32605
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "Build20210512-gpt@64bit\0\0\0\1f\0\0\0\4"..., 32706, 0, NULL, NULL) = 15617
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\v\254\0dbdpg_p30934_1b\0\0\0\1\r\0\0\0\00717"..., 3011, MSG_NOSIGNAL, NULL, 0) = 3011
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\0\362\0\tid\0\0\0@*\0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 10560
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0+\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\vboo"..., 66, MSG_NOSIGNAL, NULL, 0) = 66
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 598
getpid()                                = 30934
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0)\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\tpoo"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 554
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0+\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\vboo"..., 66, MSG_NOSIGNAL, NULL, 0) = 66
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 568
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0)\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\tpoo"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 577
getpid()                                = 30934
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0)\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\tpoo"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 595
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0)\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\tpoo"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 565
getpid()                                = 30934
getpid()                                = 30934
getpid()                                = 30934
getpid()                                = 30934
sendto(13<socket:[747058]>, "B\0\0\0)\0dbdpg_p30934_1\0\0\0\0\1\0\0\0\tpoo"..., 64, MSG_NOSIGNAL, NULL, 0) = 64
poll([{fd=13<socket:[747058]>, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13<socket:[747058]>, "2\0\0\0\4T\0\0\1X\0\rid\0\0\0@ \0\1\0\0\0\27\0\4\377\377\377\377\0"..., 32768, 0, NULL, NULL) = 572
getpid()                                = 30934
getpid()                                = 30934

I wonder where the many "sendto" with "boo" and "poo" come from, could be just tests overview page.

I disabled AMQP in o3 /etc/openqa/openqa.ini assuming that maybe it's also problematic as it involves remote communication.

Disabled most content in /usr/share/openqa/templates/webapi/test/overview.html.ep

Jane suggested https://metacpan.org/pod/App%3a%3aStacktrace . seems like we do not have that package within openSUSE Leap so I installed perl-App-cpanminus gcc and then cpanm --notest App::Stacktrace as okurz and called sudo sh -c 'PERL5LIB=/home/okurz/perl5/lib/perl5 /home/okurz/perl5/bin/perl-stacktrace 7278' which aborted with Cannot access memory at address 0xffffffffae9b5970 so not progressed much.

#7 Updated by okurz 5 months ago

  • Description updated (diff)

disabled /usr/share/openqa/templates/webapi/main/group_overview.html.ep as well

#9 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)[2021-05-18.*api failure: [0-9]* response: Request Timeout":retry
  • Description updated (diff)

Enabled all pages in openQA again, enabled amqp, enabled job done hooks, started openqa-scheduler.

Started one pipeline in https://gitlab.suse.de/openqa/auto-review/-/pipelines/140956 to handle the accumulated incomplete and failed.

#10 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)[2021-05-18.*api failure: [0-9]* response: Request Timeout":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)\[2021-05-18.*api failure: [0-9]* response: Request Timeout":retry

#11 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)\[2021-05-18.*api failure: [0-9]* response: Request Timeout":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-18.*api failure: [0-9]* response: Request Timeout":retry

#12 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-18.*api failure: [0-9]* response: Request Timeout":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-18.*api failure: [0-9]* response: (Request Timeout|timestamp mismatch)":retry

#13 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-18.*api failure: [0-9]* response: (Request Timeout|timestamp mismatch)":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: [0-9]* response: (Request Timeout|timestamp mismatch)":retry

#14 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: [0-9]* response: (Request Timeout|timestamp mismatch)":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: [0-9]* response: (Request Timeout|timestamp mismatch|Proxy Error|Service Unavailable|Temporary failure in name resolution)":retry

#15 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: [0-9]* response: (Request Timeout|timestamp mismatch|Proxy Error|Service Unavailable|Temporary failure in name resolution)":retry to openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: ([0-9]* response: (Request Timeout|timestamp mismatch|Proxy Error|Service Unavailable)|Temporary failure in name resolution)":retry

#16 Updated by okurz 5 months ago

  • Description updated (diff)

Completed all rollback steps. So we have a rewrite rule excluding many users from working with the o3 webUI normally which we need to improve. Also we should conduct a "Five Why" analysis and come up with improvements, e.g. rate limiting on multiple levels.

ideas:

  • only allow certain routes or query parameters for logged in users
  • disable more costly query parameters by default, e.g. no comment, tag, label parsing
  • use external tools to enforce rate limiting, e.g. in the apache proxy

#17 Updated by okurz 5 months ago

  • Priority changed from Immediate to Urgent

#18 Updated by openqa_review 5 months ago

  • Due date set to 2021-06-02

Setting due date based on mean cycle time of SUSE QE Tools

#19 Updated by okurz 5 months ago

  • Subject changed from openqa.opensuse.org down, o3 VM reachable, no failed service auto_review:"(?s)2021-05-1[78].*api failure: ([0-9]* response: (Request Timeout|timestamp mismatch|Proxy Error|Service Unavailable)|Temporary failure in name resolution)":retry to openqa.opensuse.org down, o3 VM reachable, no failed service
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Disabled the block patterns again for now, looks ok so far, monitoring in htop. Removing auto_review string auto_review:"(?s)2021-05-1[78].*api failure: ([0-9]* response: (Request Timeout|timestamp mismatch|Proxy Error|Service Unavailable)|Temporary failure in name resolution)":retry to prevent overly heavy grep calls (as seen on o3 in htop as well).

#20 Updated by okurz 5 months ago

#21 Updated by favogt 5 months ago

Happening again today. I reenabled the block, situation is normal again.

#24 Updated by okurz 5 months ago

  • Private changed from Yes to No

#25 Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

I guess we will keep it as is and followup in #92854 with hopefully a better solution

Also available in: Atom PDF