Project

General

Profile

Actions

action #178642

open

coordination #127031: [saga][epic] openQA for SUSE customers

coordination #138365: [epic] openQA works in SELinux enforced environments

openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out" size:S

Added by livdywan 20 days ago. Updated 2 days ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-03-11
Due date:
2025-04-04 (Due in 4 days)
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_from_bootstrap@64bit-2G fails in
openqa_webui. It seems like http://localhost never comes up and the web UI within the test is not running:

# Test died: command 'skip_suse_specifics=1 skip_suse_tests=1 /usr/share/openqa/script/openqa-bootstrap' timed out at /usr/lib/os-autoinst/autotest.pm line 416.

There's another case which looks a little different, but seems to be a symptom of the same issue:

https://openqa.opensuse.org/tests/4914440#step/dashboard/6

Acceptance Criteria

  • AC1: openQA in openQA tests pass reliably
  • AC2: Tests in GitHub pull requests pass reliably

Reproducible

Fails since (at least) Build :TW.35346

Expected result

Last good: :TW.35345 (or more recent)

Further details

Always latest result in this scenario: latest

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Project (public) - action #167395: Ensure only the tested revision of devel:openQA packages are submitted to openSUSE:Factory size:MResolvedmkittler2024-09-25

Actions
Copied to openQA Project (public) - action #178822: openQA in openQA tests failing with unreachable webUI, possibly due to SELinux size:SResolveddheidler2025-04-02

Actions
Copied to openQA Project (public) - action #179131: Tests failing with zypper error about package corrupted during transfer auto_review:"Package perl-DBIx-Class-DeploymentHandler.*seems to be corrupted during transfer"Resolvedokurz

Actions
Actions #1

Updated by livdywan 20 days ago

  • Tags changed from alert, reactive work to alert, reactive work, infra
Actions #2

Updated by livdywan 20 days ago

This looks to be due to MirrorCache issues

Actions #3

Updated by livdywan 20 days ago

  • Subject changed from openqa-bootstrap times out in openqa_webui test in openQA to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to livdywan
  • Priority changed from High to Urgent

Raising priority as it's also affecting PR's. Tho we can probably not do more than wait for the MirrorCache configuration to be fixed.

Actions #4

Updated by livdywan 20 days ago

Asking for an update. Wondering if we can mitigate this if it takes longer to resolve.

Actions #5

Updated by livdywan 20 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN size:S
  • Description updated (diff)
Actions #6

Updated by openqa_review 19 days ago

  • Due date set to 2025-03-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by livdywan 19 days ago

Discussed this in the unblock:

Also:

https://openqa.opensuse.org/tests/4917202/logfile?filename=start_test-journal.log.txt#line-5402

Mar 12 06:02:29 susetest systemd[1]: Starting openQA Worker #1...
Mar 12 06:02:29 susetest systemd[1]: Started openQA Worker #1.
Mar 12 06:02:29 susetest dns-dnsmasq.sh[12146]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Mar 12 06:02:29 susetest ovs-vsctl[12150]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set bridge up rstp_enable=true
Mar 12 06:02:29 susetest ovs-vsctl[12150]: ovs|00002|db_ctl_base|ERR|no row "up" in table Bridge
Mar 12 06:02:29 susetest nm-dispatcher[12150]: ovs-vsctl: no row "up" in table Bridge
Mar 12 06:02:29 susetest nm-dispatcher[8253]: req:267 'up' [tap82], "/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh' exited with status 1
Mar 12 06:02:29 susetest NetworkManager[8255]: <warn>  [1741773749.9050] dispatcher: (266) /etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh failed (failed): Script '/etc/NetworkManager/dispatcher.d/gre_tunnel_preup.sh' exited with status 1
Mar 12 06:02:30 susetest systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Mar 12 06:02:31 susetest worker[12142]: [info] worker 1:
Mar 12 06:02:31 susetest worker[12142]:  - config file:                      /etc/openqa/workers.ini
Mar 12 06:02:31 susetest worker[12142]:  - name used to register:            susetest
Mar 12 06:02:31 susetest worker[12142]:  - worker address (WORKER_HOSTNAME): localhost
Mar 12 06:02:31 susetest worker[12142]:  - isotovideo version:               43
Mar 12 06:02:31 susetest worker[12142]:  - websocket API version:            1
Mar 12 06:02:31 susetest worker[12142]:  - web UI hosts:                     localhost
Mar 12 06:02:31 susetest worker[12142]:  - class:                            ?
Mar 12 06:02:31 susetest worker[12142]:  - no cleanup:                       no
Mar 12 06:02:31 susetest worker[12142]:  - pool directory:                   /var/lib/openqa/pool/1
Mar 12 06:02:31 susetest worker[12142]: [info] Project dir for host localhost is /var/lib/openqa/share
Mar 12 06:02:31 susetest worker[12142]: [info] Registering with openQA localhost
Mar 12 06:02:31 susetest worker[12142]: [warn] Failed to register at localhost - 503 response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Mar 12 06:02:31 susetest worker[12142]: <html><head>
Mar 12 06:02:31 susetest worker[12142]: <title>503 Service Unavailable</title>
Mar 12 06:02:31 susetest worker[12142]: </head><body>
Mar 12 06:02:31 susetest worker[12142]: <h1>Service Unavailable</h1>
Mar 12 06:02:31 susetest worker[12142]: <p>The server is temporarily unable to service your
Mar 12 06:02:31 susetest worker[12142]: request due to maintenance downtime or capacity
Mar 12 06:02:31 susetest worker[12142]: problems. Please try again later.</p>
Mar 12 06:02:31 susetest worker[12142]: <p>Additionally, a 503 Service Unavailable
Mar 12 06:02:31 susetest worker[12142]: error was encountered while trying to use an ErrorDocument to handle the request.</p>
Mar 12 06:02:31 susetest worker[12142]: <hr>
Mar 12 06:02:31 susetest worker[12142]: <address>Apache Server at localhost Port 80</address>
Mar 12 06:02:31 susetest worker[12142]: </body></html>
Mar 12 06:02:31 susetest worker[12142]:  - trying again in 10 seconds

Next steps:

  • Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)
  • Mitigate via autoregex or if needed by switching off email notifications temporarily
Actions #8

Updated by szarate 19 days ago · Edited

I wonder if this is more about selinux... https://openqa.opensuse.org/tests/4917476#step/dashboard/7 is the same error that I'm having on my Tumbleweed installation of openQA (after updating just today)

and the logs are showing constant denies from selinux:

ket permissive=0
type=AVC msg=audit(1741786133.379:949): avc:  denied  { name_connect } for  pid=3901 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:950): avc:  denied  { name_connect } for  pid=16186 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:951): avc:  denied  { name_connect } for  pid=16186 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0
type=AVC msg=audit(1741786133.379:952): avc:  denied  { name_connect } for  pid=3901 comm="httpd-prefork" dest=9526 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:openqa_port_t:s0 tclass=tcp_socket permissive=0

Actions #9

Updated by okurz 18 days ago

  • Tags changed from alert, reactive work, infra to alert, reactive work

Actually not related to our QE infra, removed tag "infra".

Actions #10

Updated by livdywan 18 days ago

  • Tags changed from alert, reactive work to alert, reactive work, infra
  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard" size:S
Actions #11

Updated by livdywan 18 days ago

  • Tags changed from alert, reactive work, infra to alert, reactive work
Actions #12

Updated by livdywan 18 days ago

  • Priority changed from Urgent to High
  • Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/226

  • Mitigate via autoregex or if needed by switching off email notifications temporarily
auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard"
Actions #13

Updated by tinita 18 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard" size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard|--fail-with-body.*failed" size:S

Added --fail-with-body.*failed to auto_review

Actions #14

Updated by tinita 18 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboard|--fail-with-body.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S

Added zypper.*ref.*failed to auto_review. Had to shorten the title elsewhere as we reached the maximum title length :)

Actions #15

Updated by tinita 18 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to missbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S
Actions #16

Updated by livdywan 18 days ago

livdywan wrote in #note-12:

  • Follow suggestions to try and make tests fail faster when the web UI is not working (using curl)

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/226

Also https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/227 this is why we still need to match the failing needle in the auto_review regex

Actions #17

Updated by tinita 18 days ago

@szarate suggests to try out enforcing=0 as a kernel boot parameter.
Who would be able to try this out?

Actions #18

Updated by okurz 18 days ago

  • Copied to action #178822: openQA in openQA tests failing with unreachable webUI, possibly due to SELinux size:S added
Actions #19

Updated by okurz 18 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with unreachable webUI, possibly due to SELinux
  • Due date deleted (2025-03-26)
  • Status changed from In Progress to New
  • Assignee deleted (livdywan)
  • Start date deleted (2025-03-11)

I created a separate ticket for the SELinux cause #178822

Actions #20

Updated by okurz 18 days ago

  • Subject changed from openQA in openQA tests failing with unreachable webUI, possibly due to SELinux to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S
  • Due date set to 2025-03-26
  • Status changed from New to In Progress
  • Assignee set to livdywan
  • Start date set to 2025-03-11
Actions #21

Updated by okurz 18 days ago

  • Parent task set to #138365
Actions #22

Updated by okurz 18 days ago

  • Project changed from openQA Tests (public) to openQA Project (public)
  • Category changed from Bugs in existing tests to Regressions/Crashes
Actions #23

Updated by livdywan 17 days ago · Edited

Going through current runs right now just to see what is MirrorCache and what may not be. And to refine the auto_review regex.

This looks like package installation was aborted after retries ran out https://openqa.opensuse.org/tests/4920786

# Test died: command 'retry -e -s 30 -r 7 -- sh -c "zypper -n --gpg-auto-import-keys ref && zypper --no-cd -n in openQA-worker"' timed out at /var/lib/openqa/pool/24/os-autoinst-distri-openQA/tests/install/openqa_worker.pm line 7.

https://openqa.opensuse.org/tests/4920784#step/openqa_worker/3 Same

Edit: I think all the cases now are mirror-related, and I'm checking with upstream again. retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out should cover all of these.

Actions #24

Updated by livdywan 17 days ago

  • Subject changed from openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"openqa-bootstrap.*timed out|openqa-cli schedule.*failed|no candidate needle.*openqa-dashboa|--fail-with-body.*failed|zypper.*ref.*failed" size:S to openQA in openQA tests failing with 503 errors and timeouts due to misbehaving MirrorCache / CDN auto_review:"retry.*zypper.*ref && zypper --no-cd -n in openQA-worker.*timed out" size:S
Actions #25

Updated by livdywan 12 days ago

  • Copied to action #179131: Tests failing with zypper error about package corrupted during transfer auto_review:"Package perl-DBIx-Class-DeploymentHandler.*seems to be corrupted during transfer" added
Actions #26

Updated by livdywan 12 days ago

  • Status changed from In Progress to Blocked

Currently only specifically cases of #179131 seem to be failing hence blocking on that. To be re-visited after that.

Actions #27

Updated by okurz 12 days ago

  • Status changed from Blocked to Workable

Please incorporate the suggestions from anikitin from the Slack conversation where you asked into action items, e.g. about circumventing mirrors for devel:openQA:tested where we frequently remove packages again.

Actions #28

Updated by livdywan 6 days ago

maybe we just need to add ?PEDANTIC=1 to the repo url somehow. That will signal the redirector that it should check the file on mirrors before trying to redirect.
Alternatively, we can use downloadcontent.opensuse.org instead of download.opensuse.org . that will access the real files directly and never try to redirect to mirrors. We usually try to avoid it, because it is very easy to overload the storage if everyone attempts to use it

So the suggestion is to avoid syncing packages that get created and deleted periodically.

Actions #29

Updated by livdywan 4 days ago

  • Due date changed from 2025-03-26 to 2025-03-28

maybe we just need to add ?PEDANTIC=1 to the repo url somehow. That will signal the redirector that it should check the file on mirrors before trying to redirect.
Alternatively, we can use downloadcontent.opensuse.org instead of download.opensuse.org . that will access the real files directly and never try to redirect to mirrors. We usually try to avoid it, because it is very easy to overload the storage if everyone attempts to use it

So the suggestion is to avoid syncing packages that get created and deleted periodically.

Unfortunately I could not find out where to make those changes and did not get any suggestions in the daily.

Pointers appreciated.

Actions #31

Updated by livdywan 4 days ago · Edited

Let's see if this works:

openqa-clone-job --repeat 1 --within-instance https://openqa.opensuse.org/tests/4952887 _GROUP=0 BUILD+=poo#178642 OPENQA_REPO_URL="obs://devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1"

https://openqa.opensuse.org/tests/4952937

Actions #32

Updated by livdywan 3 days ago

livdywan wrote in #note-31:

Let's see if this works:

openqa-clone-job --repeat 1 --within-instance https://openqa.opensuse.org/tests/4952887 _GROUP=0 BUILD+=poo#178642 OPENQA_REPO_URL="obs://devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1"

Apparently the ? gets encoded when used as part of a obs:// URL. I couldn't find any docs for this scheme, so I'm going for the verbose version now.

zypper rr openQA; zypper -n ar -p 95 -f 'https://download.opensuse.org/repositories/devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1' openQA; zypper ref openQA

This seems to work: https://openqa.opensuse.org/tests/4953020

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233

Actions #33

Updated by livdywan 3 days ago

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233

Discussing this further with Andrii:

  • There's an inherent race condition between adding the repo and installing packages. Unless we can guarantee nothing is deleted while the test is running.
  • Actually not deleting packages in the repo would be preferable and more in line with how download.opensuse.org is designed to work.
Actions #34

Updated by livdywan 3 days ago

  • Status changed from Workable to Feedback
  • Actually not deleting packages in the repo would be preferable and more in line with how download.opensuse.org is designed to work.

This would mean re-visiting #167395 and considering alternatives.

Actions #35

Updated by livdywan 3 days ago

  • Related to action #167395: Ensure only the tested revision of devel:openQA packages are submitted to openSUSE:Factory size:M added
Actions #36

Updated by livdywan 3 days ago

livdywan wrote in #note-32:

livdywan wrote in #note-31:

Let's see if this works:

openqa-clone-job --repeat 1 --within-instance https://openqa.opensuse.org/tests/4952887 _GROUP=0 BUILD+=poo#178642 OPENQA_REPO_URL="obs://devel:openQA/openSUSE_Tumbleweed?PEDANTIC=1"

Apparently the ? gets encoded when used as part of a obs:// URL. I couldn't find any docs for this scheme, so I'm going for the verbose version now.

Not a blocker, but for reference I reported this issue upstream.

Actions #37

Updated by livdywan 2 days ago

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233

Review on-going. Maybe we need to discuss in more detail next week.

Actions #38

Updated by livdywan 2 days ago

  • Due date changed from 2025-03-28 to 2025-04-04
Actions #39

Updated by livdywan 2 days ago

livdywan wrote in #note-37:

https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/233

Review on-going. Maybe we need to discuss in more detail next week.

Merged. Let's see how well this works 🤞🏼

Actions

Also available in: Atom PDF