action #178642
opencoordination #127031: [saga][epic] openQA for SUSE customers
coordination #138365: [epic] openQA works in SELinux enforced environments
openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out" size:S
0%
Description
Observation¶
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_from_bootstrap@64bit-2G fails in
openqa_webui. It seems like http://localhost never comes up and the web UI within the test is not running:
# Test died: command 'skip_suse_specifics=1 skip_suse_tests=1 /usr/share/openqa/script/openqa-bootstrap' timed out at /usr/lib/os-autoinst/autotest.pm line 416.
There's another case which looks a little different, but seems to be a symptom of the same issue:
https://openqa.opensuse.org/tests/4914440#step/dashboard/6
Acceptance Criteria¶
- AC1: openQA in openQA tests pass reliably
- AC2: Tests in GitHub pull requests pass reliably
Reproducible¶
Fails since (at least) Build :TW.35346
Expected result¶
Last good: :TW.35345 (or more recent)
Further details¶
Always latest result in this scenario: latest
Suggestions¶
- Ask in #help-mirrorcache
- Consider not using MirrorCache anymore?
- Make these tests optional for deployment?
Updated by livdywan 20 days ago
- Subject changed from openqa-bootstrap times out in openqa_webui test in openQA to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to livdywan
- Priority changed from High to Urgent
Raising priority as it's also affecting PR's. Tho we can probably not do more than wait for the MirrorCache configuration to be fixed.
Updated by openqa_review 19 days ago
- Due date set to 2025-03-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 19 days ago
Discussed this in the unblock:
- Add an autoreview regex
- Add a curl call to openqa_webui after systemctl status
- we do this as part of bootstrap c.f. https://openqa.opensuse.org/tests/4917199#step/openqa_webui/7
- we "successfully" use curl to print a 503 error c.f. https://openqa.opensuse.org/tests/4917202#step/openqa_worker/3
- use curl --fail or curl --fail-body or even better curl -w "\n%{http_code}\n"
- Does systemd have a concept of health checks? Could we add it there?
- timestamps don't match? but authentication should still work. so nevermind?
- is
No such timeout policy ovs_test_tp
relevant? double-check if this occurs in passing jobs - in some cases package downloads are failing temporarily but successfully retried, which is only visible on video e.g. https://openqa.opensuse.org/tests/4917202/video?filename=video.webm
Also:
https://openqa.opensuse.org/tests/4917202/logfile?filename=start_test-journal.log.txt#line-5402
Mar 12 06:02:29 susetest systemd[1]: Starting openQA Worker #1...
Mar 12 06:02:29 susetest systemd[1]: Started openQA Worker #1.
Mar 12 06:02:29 susetest dns-dnsmasq.sh[12146]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Mar 12 06:02:29 susetest ovs-vsctl[12150]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set bridge up rstp_enable=true
Mar 12 06:02:29 susetest ovs-vsctl[12150]: ovs|00002|db_ctl_base|ERR|no row "up" in table Bridge
Mar 12 06:02:29 susetest nm-dispatcher[12150]: ovs-vsctl: no row "up" in table Bridge
Mar 12 06:02:29 susetest nm-dispatcher[8253]: req:267 'up' [tap82], "/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh' exited with status 1
Mar 12 06:02:29 susetest NetworkManager[8255]: <warn> [1741773749.9050] dispatcher: (266) /etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh failed (failed): Script '/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh' exited with status 1
Mar 12 06:02:30 susetest systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Mar 12 06:02:31 susetest worker[12142]: [info] worker 1:
Mar 12 06:02:31 susetest worker[12142]: - config file: /etc/openqa/workers.ini
Mar 12 06:02:31 susetest worker[12142]: - name used to register: susetest
Mar 12 06:02:31 susetest worker[12142]: - worker address (WORKER_HOSTNAME): localhost
Mar 12 06:02:31 susetest worker[12142]: - isotovideo version: 43
Mar 12 06:02:31 susetest worker[12142]: - websocket API version: 1
Mar 12 06:02:31 susetest worker[12142]: - web UI hosts: localhost
Mar 12 06:02:31 susetest worker[12142]: - class: ?
Mar 12 06:02:31 susetest worker[12142]: - no cleanup: no
Mar 12 06:02:31 susetest worker[12142]: - pool directory: /var/lib/openqa/pool/1
Mar 12 06:02:31 susetest worker[12142]: [info] Project dir for host localhost is /var/lib/openqa/share
Mar 12 06:02:31 susetest worker[12142]: [info] Registering with openQA localhost
Mar 12 06:02:31 susetest worker[12142]: [warn] Failed to register at localhost - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Mar 12 06:02:31 susetest worker[12142]: <html><head>
Mar 12 06:02:31 susetest worker[12142]: <title>503 Service Unavailable</title>
Mar 12 06:02:31 susetest worker[12142]: </head><body>
Mar 12 06:02:31 susetest worker[12142]: <h1>Service Unavailable</h1>
Mar 12 06:02:31 susetest worker[12142]: <p>The server is temporarily unable to service your
Mar 12 06:02:31 susetest worker[12142]: request due to maintenance downtime or capacity
Mar 12 06:02:31 susetest worker[12142]: problems. Please try again later.</p>
Mar 12 06:02:31 susetest worker[12142]: <p>Additionally, a 503 Service Unavailable
Mar 12 06:02:31 susetest worker[12142]: error was encountered while trying to use an ErrorDocument to handle the request.</p>
Mar 12 06:02:31 susetest worker[12142]: <hr>
Mar 12 06:02:31 susetest worker[12142]: <address>Apache Server at localhost Port 80</address>
Mar 12 06:02:31 susetest worker[12142]: </body></html>
Mar 12 06:02:31 susetest worker[12142]: - trying again in 10 seconds
Next steps:
- Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)
- Mitigate via autoregex or if needed by switching off email notifications temporarily
Updated by szarate 19 days ago · Edited
I wonder if this is more about selinux... https://openqa.opensuse.org/tests/4917476#step/dashboard/7 is the same error that I'm having on my Tumbleweed installation of openQA (after updating just today)
and the logs are showing constant denies from selinux:
ket permissive=0
type=AVC msg=audit(1741786133.379:949): avc: denied { name_connect } for pid=3901 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:950): avc: denied { name_connect } for pid=16186 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:951): avc: denied { name_connect } for pid=16186 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:952): avc: denied { name_connect } for pid=3901 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
Updated by livdywan 18 days ago
- Tags changed from alert, reactive work to alert, reactive work, infra
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard" size:S
Updated by livdywan 18 days ago
- Priority changed from Urgent to High
- Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/226
- Mitigate via autoregex or if needed by switching off email notifications temporarily
auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard"
Updated by tinita 18 days ago
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard" size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard|--fail-with-body.*failed" size:S
Added --fail-with-body.*failed
to auto_review
Updated by tinita 18 days ago
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard|--fail-with-body.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S
Added zypper.*ref.*failed
to auto_review. Had to shorten the title elsewhere as we reached the maximum title length :)
Updated by tinita 18 days ago
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S
Updated by livdywan 18 days ago
livdywan wrote in #note-12:
- Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/226
Also https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/227 this is why we still need to match the failing needle in the auto_review regex
Updated by okurz 18 days ago
- Copied to action #178822: openQA in openQA tests failing with unreachable webUI, possibly due to SELinux size:S added
Updated by okurz 18 days ago
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with unreachable webUI, possibly due to SELinux
- Due date deleted (
2025-03-26) - Status changed from In Progress to New
- Assignee deleted (
livdywan) - Start date deleted (
2025-03-11)
I created a separate ticket for the SELinux cause #178822
Updated by okurz 18 days ago
- Subject changed from openQA in openQA tests failing with unreachable webUI, possibly due to SELinux to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S
- Due date set to 2025-03-26
- Status changed from New to In Progress
- Assignee set to livdywan
- Start date set to 2025-03-11
Updated by livdywan 17 days ago · Edited
Going through current runs right now just to see what is MirrorCache and what may not be. And to refine the auto_review regex.
This looks like package installation was aborted after retries ran out https://openqa.opensuse.org/tests/4920786
# Test died: command 'retry -e -s 30 -r 7 -- sh -c "zypper -n --gpg-auto-import-keys ref && zypper --no-cd -n in openQA-worker"' timed out at /var/lib/openqa/pool/24/os-autoinst-distri-openQA/tests/install/openqa_worker.pm line 7.
https://openqa.opensuse.org/tests/4920784#step/openqa_worker/3 Same
Edit: I think all the cases now are mirror-related, and I'm checking with upstream again. retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out
should cover all of these.
Updated by livdywan 17 days ago
- Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out" size:S
Updated by livdywan 12 days ago
- Copied to action #179131: Tests failing with zypper error about package corrupted during transfer auto_review:"Package perl-DBIx-Class-DeploymentHandler.*seems to be corrupted during transfer" added
Updated by livdywan 6 days ago
maybe we just need to add ?PEDANTIC=1 to the repo url somehow. That will signal the redirector that it should check the file on mirrors before trying to redirect.
Alternatively, we can use downloadcontent.opensuse.org instead of download.opensuse.org . that will access the real files directly and never try to redirect to mirrors. We usually try to avoid it, because it is very easy to overload the storage if everyone attempts to use it
So the suggestion is to avoid syncing packages that get created and deleted periodically.
Updated by livdywan 4 days ago
- Due date changed from 2025-03-26 to 2025-03-28
maybe we just need to add ?PEDANTIC=1 to the repo url somehow. That will signal the redirector that it should check the file on mirrors before trying to redirect.
Alternatively, we can use downloadcontent.opensuse.org instead of download.opensuse.org . that will access the real files directly and never try to redirect to mirrors. We usually try to avoid it, because it is very easy to overload the storage if everyone attempts to use itSo the suggestion is to avoid syncing packages that get created and deleted periodically.
Unfortunately I could not find out where to make those changes and did not get any suggestions in the daily.
Pointers appreciated.
Updated by livdywan 3 days ago
livdywan wrote in #note-31:
Let's see if this works:
openqa-clone-job --repeat 1 --within-instance https://openqa.opensuse.org/tests/4952887 _GROUP=0 BUILD+=poo#178642 OPENQA_REPO_URL="obs://devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1"
Apparently the ? gets encoded when used as part of a obs:// URL. I couldn't find any docs for this scheme, so I'm going for the verbose version now.
zypper rr openQA; zypper -n ar -p 95 -f 'https://download.opensuse.org/repositories/devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1' openQA; zypper ref openQA
This seems to work: https://openqa.opensuse.org/tests/4953020
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233
Updated by livdywan 3 days ago
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233
Discussing this further with Andrii:
- There's an inherent race condition between adding the repo and installing packages. Unless we can guarantee nothing is deleted while the test is running.
- Actually not deleting packages in the repo would be preferable and more in line with how download.opensuse.org is designed to work.
Updated by livdywan 3 days ago
- Related to action #167395: Ensure only the tested revision of devel:openQA packages are submitted to openSUSE:Factory size:M added
Updated by livdywan 3 days ago
livdywan wrote in #note-32:
livdywan wrote in #note-31:
Let's see if this works:
openqa-clone-job --repeat 1 --within-instance https://openqa.opensuse.org/tests/4952887 _GROUP=0 BUILD+=poo#178642 OPENQA_REPO_URL="obs://devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1"
Apparently the ? gets encoded when used as part of a obs:// URL. I couldn't find any docs for this scheme, so I'm going for the verbose version now.
Not a blocker, but for reference I reported this issue upstream.
Updated by livdywan 2 days ago
https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233
Review on-going. Maybe we need to discuss in more detail next week.