Project

General

Profile

Actions

action #130477

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #108209: [epic] Reduce load on OSD

[O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:M

Added by Julie_CAO 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2023-06-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

The virt-install command failed to download kernel files from O3 repo sporadically: --location http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT

the latest test hung at downloading initrd: https://openqa.opensuse.org/tests/3339451#step/unified_guest_installation/424
https://openqa.opensuse.org/tests/3324493#step/unified_guest_installation/424
1 test failed to check the location of initrd: https://openqa.opensuse.org/tests/3302510#step/unified_guest_installation/519
1 test failed to download linuxz: https://openqa.opensuse.org/tests/3302510#step/unified_guest_installation/519
1 test failed to download initrd: https://openqa.opensuse.org/tests/3307623#step/unified_guest_installation/1324

It appears that the http services in ariel do not function well at times. I don't know which web server is in use, nginx or apache2? I know little about the web services stuff, I only found some suspicious error logs on ariel. Could you please investigate if the http services are working well? and what's the cause of our test failure in downloading kernel files from O3 repo?

sudo journalctl -u apache2 -u nginx -l
Jun 06 12:06:44 ariel systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jun 06 12:06:44 ariel systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jun 06 12:06:58 ariel systemd[1]: Stopping The nginx HTTP and reverse proxy server...
Jun 06 12:07:03 ariel systemd[1]: nginx.service: State 'stop-sigterm' timed out. Killing.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 29445 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 29446 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 5934 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 5935 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 5936 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 2947 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Main process exited, code=killed, status=9/KILL
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 5936 (nginx) with signal SIGKILL.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Killing process 2947 (nginx) with signal SIGKILL.
Jun 01 13:27:50 ariel nginx[11677]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Failed with result 'timeout'.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Unit process 29446 (nginx) remains running after unit stopped.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Unit process 29447 (nginx) remains running after unit stopped.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Unit process 29449 (nginx) remains running after unit stopped.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Unit process 5933 (nginx) remains running after unit stopped.
Jun 06 12:07:03 ariel systemd[1]: nginx.service: Unit process 2947 (nginx) remains running after unit stopped.
Jun 06 12:07:03 ariel systemd[1]: Stopped The nginx HTTP and reverse proxy server.
Jun 06 12:07:03 ariel systemd[1]: Starting The nginx HTTP and reverse proxy server...
Jun 06 12:07:03 ariel nginx[3063]: nginx: [warn] conflicting server name "openqa.opensuse.org" on 0.0.0.0:80, ignored
Jun 06 12:07:03 ariel nginx[3063]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Jun 06 12:07:03 ariel nginx[3063]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Jun 06 12:07:03 ariel systemd[1]: Started The nginx HTTP and reverse proxy server.
Jun 06 12:07:04 ariel nginx[3065]: nginx: [warn] conflicting server name "openqa.opensuse.org" on 0.0.0.0:80, ignored
Jun 06 12:08:39 ariel systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jun 06 12:08:39 ariel systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jun 06 13:46:33 ariel systemd[1]: Starting The Apache Webserver...
Jun 06 13:46:34 ariel start_apache2[26924]: (98)Address already in use: AH00072: make_sock: could not bind to address 
Jun 06 13:46:34 ariel start_apache2[26924]: (98)Address already in use: AH00072: make_sock: could not bind to address 
Jun 06 13:46:34 ariel start_apache2[26924]: no listening sockets available, shutting down
Jun 06 13:46:34 ariel start_apache2[26924]: AH00015: Unable to open logs
Jun 06 13:46:34 ariel systemd[1]: apache2.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 13:46:34 ariel systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 06 13:46:34 ariel systemd[1]: Failed to start The Apache Webserver.
Jun 06 15:48:33 ariel systemd[1]: Starting The Apache Webserver...
Jun 06 15:48:33 ariel start_apache2[15235]: (98)Address already in use: AH00072: make_sock: could not bind to address ...
Jun 06 15:48:33 ariel systemd[1]: apache2.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 15:48:33 ariel systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 06 15:48:33 ariel systemd[1]: Failed to start The Apache Webserver.
Jun 06 16:29:44 ariel systemd[1]: Starting The Apache Webserver...
Jun 06 16:29:44 ariel start_apache2[14895]: (98)Address already in use: AH00072: make_sock: could not bind to address [>
Jun 06 16:29:44 ariel start_apache2[14895]: (98)Address already in use: AH00072: make_sock: could not bind to address 0>
Jun 06 16:29:44 ariel start_apache2[14895]: no listening sockets available, shutting down
Jun 06 16:29:44 ariel start_apache2[14895]: AH00015: Unable to open logs
Jun 06 16:29:45 ariel systemd[1]: apache2.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 16:29:45 ariel systemd[1]: apache2.service: Failed with result 'exit-code'.
Jun 06 16:29:45 ariel systemd[1]: Failed to start The Apache Webserver.
Jun 07 00:01:24 ariel systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jun 07 00:01:25 ariel systemd[1]: Reloaded The nginx HTTP and reverse proxy server.
Jun 07 00:02:25 ariel systemd[1]: Reloading The nginx HTTP and reverse proxy server...
Jun 07 00:02:25 ariel systemd[1]: Reloaded The nginx HTTP and reverse proxy server.

Acceptance criteria

  • AC1: No unexpected logs from multiple web servers e.g. not Apache and nginx at the same time
  • AC2: It is understood what routes are usable (http vs https)
  • AC3: openQA pulls in all necessary service dependencies on a web proxy but only really necessary ones
  • AC4: apache2 is not automatically restarted on o3 when nginx is already running

Suggestions

  • Maybe some script is restarting apache. unmask the service?
  • TLS is not provided by us (ha-proxy managed by heroes); but let's still clarify what is supposed to work?
  • Ask other admins on openSUSE channels about possibly running scripts that interfere and maybe start Apache2
  • Make apache2 and nginx exclusive or use a generic target in systemd or just drop it
  • Potentially mention in documentation that to prevent a conflict apache and nginx should not be installed together

Related issues 2 (0 open2 closed)

Related to openQA Project - action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing featuresResolvedkraih

Actions
Copied to openQA Project - action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:SResolveddheidler

Actions
Actions #1

Updated by okurz 11 months ago

  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #2

Updated by okurz 11 months ago

  • Related to action #129490: high response times on osd - Try nginx on o3 with enabled load limiting or load balancing features added
Actions #3

Updated by Julie_CAO 11 months ago

Hold on please, I just found some clue. The failure might come from test side rather than O3 server side. I'll look into it closely and update to you.

Actions #4

Updated by okurz 11 months ago

  • Project changed from openQA Infrastructure to openQA Project
  • Category set to Support
  • Status changed from New to Feedback
  • Assignee set to okurz

ok, tracking the ticket as support ticket then

Actions #5

Updated by kraih 11 months ago

I noticed that those URLs start with http://, that will trigger a redirect to https://. Maybe that is not handled correctly in the test?

Actions #6

Updated by Julie_CAO 11 months ago

kraih wrote:

I noticed that those URLs start with http://, that will trigger a redirect to https://. Maybe that is not handled correctly in the test?

Yes, that's the thing. I initially tried my command with "http://download.opensuse.org" and it took 3 minutes. while with "http://O3" it took 23 minutes. Finally I changed to use 'https://O3', it took 3 minutes too.

I will switch to 'https' in my test. thank you.

The virt-install command with "http://" I used. the output tells 23 minutes in total.

~#: virt-install --name vm_o3 -v --location http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT --disk path=/var/lib/libvirt/images/vm_o3.qcow2,size=20,format=qcow2 --vcpus 1 --memory 4096 --extra-args "console=ttyS0,115200n8" --network=bridge=br0,model=virtio --vnc --extra-args "textmode=1" --autoconsole text --debug
...
[Thu, 08 Jun 2023 02:59:47 virt-install 38477] DEBUG (urldetect:303) Finding distro store for location=http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT
[Thu, 08 Jun 2023 03:04:09 virt-install 38477] DEBUG (urlfetcher:105) Fetching URI: http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/.treeinfo
[Thu, 08 Jun 2023 03:04:09 virt-install 38477] DEBUG (urldetect:74) treeinfo family=openSUSE Tumbleweed
...
[Thu, 08 Jun 2023 03:04:10 virt-install 38477] DEBUG (cli:266) 
Starting install...

Starting install...
[Thu, 08 Jun 2023 03:08:32 virt-install 38477] DEBUG (urlfetcher:174) hasFile(http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/boot/x86_64/loader/linux) returning True
[Thu, 08 Jun 2023 03:12:54 virt-install 38477] DEBUG (urlfetcher:174) hasFile(http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/boot/x86_64/loader/initrd) returning True
[Thu, 08 Jun 2023 03:17:16 virt-install 38477] DEBUG (urlfetcher:105) Fetching URI: http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/boot/x86_64/loader/linux
Retrieving 'linux'                                                                               |  10 MB  00:00:00 ... 
[Thu, 08 Jun 2023 03:17:16 virt-install 38477] DEBUG (urlfetcher:190) Saved file to /var/lib/libvirt/boot/virtinst-9x46yz3i-linux
[Thu, 08 Jun 2023 03:21:38 virt-install 38477] DEBUG (urlfetcher:105) Fetching URI: http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/boot/x86_64/loader/initrd
Retrieving 'initrd'                                                                              | 175 MB  00:00:02 ... 
[Thu, 08 Jun 2023 03:21:41 virt-install 38477] DEBUG (urlfetcher:190) Saved file to /var/lib/libvirt/boot/virtinst-rz39eczm-initrd
Actions #7

Updated by Julie_CAO 11 months ago

  • Status changed from Feedback to New

I have to reopen it as the https connection is not always reachable.

It may fail at here, https://openqa.opensuse.org/tests/3346350#step/unified_guest_installation/421

[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (urldetect:303) Finding distro store for location=https://openqa.opensuse.org/assets/repo/openSUSE
-Tumbleweed-oss-x86_64-CURRENT
[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (osdict:138) Error creating libosinfo tree object for location=https://openqa.opensuse.org/assets/
repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT : g-io-error-quark: Failed to load .treeinfo|treeinfo file: Could not connect to openqa.opensuse.org: Conn
ection refused (39)
[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (urldetect:45) Failed to acquire file=.treeinfo: Couldn't acquire file https://openqa.opensuse.org
/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/.treeinfo: HTTPSConnectionPool(host='openqa.opensuse.org', port=443): Max retries exceeded with ur
l: /assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/.treeinfo (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23f5
14b9a0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (urldetect:45) Failed to acquire file=treeinfo: Couldn't acquire file https://openqa.opensuse.org/
assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/treeinfo: HTTPSConnectionPool(host='openqa.opensuse.org', port=443): Max retries exceeded with url:
 /assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/treeinfo (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23f514b
5e0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (urldetect:45) Failed to acquire file=content: Couldn't acquire file https://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/content: HTTPSConnectionPool(host='openqa.opensuse.org', port=443): Max retries exceeded with url: /assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/content (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23f514b2b0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
[Fri, 09 Jun 2023 00:05:27 virt-install 3968] DEBUG (urldetect:45) Failed to acquire file=media.1/products: Couldn't acquire file https://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/media.1/products: HTTPSConnectionPool(host='openqa.opensuse.org', port=443): Max retries exceeded with url: /assets/repo/openSUSE-Tumbleweed-oss-x86_64-CURRENT/media.1/products (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23f514af80>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Or fail after that, https://openqa.opensuse.org/tests/3346047#step/unified_guest_installation/437

Is there any configuration required for accessing O3 with https from test side?

Actions #8

Updated by Julie_CAO 11 months ago

And I just accessed O3 repo by using virt-install with 'http' succefully and very fast. I am confused, why sometimes 'http' worked while 'https' did not work, and vice versa.

Actions #9

Updated by okurz 11 months ago

Only http works for o3 workers within the o3 network, only https works for any connection from outside

Actions #10

Updated by okurz 11 months ago

  • Status changed from New to Feedback

You can use the variable OPENQA_HOST to find the right o3 url

Actions #11

Updated by livdywan 11 months ago

  • Description updated (diff)
Actions #12

Updated by livdywan 11 months ago

  • Subject changed from [O3]http connection to O3 repo is broken sporadically in virtualization tests to [O3]http connection to O3 repo is broken sporadically in virtualization tests size:S
Actions #13

Updated by kraih 11 months ago

Jun 06 16:29:44 ariel start_apache2[14895]: (98)Address already in use: AH00072: make_sock: could not bind to address [>
Jun 06 16:29:44 ariel start_apache2[14895]: (98)Address already in use: AH00072: make_sock: could not bind to address 0>
Jun 06 16:29:44 ariel start_apache2[14895]: no listening sockets available, shutting down
Jun 06 16:29:44 ariel start_apache2[14895]: AH00015: Unable to open logs
Jun 06 16:29:45 ariel systemd[1]: apache2.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 16:29:45 ariel systemd[1]: apache2.service: Failed with result 'exit-code'.

I think this is actually caused by dependencies in the openQA systemd unit files. Every new deployment of openQA triggers a service restart, which then tries to restart apache2. Probably those just need to be updated from apache2 to nginx.

Actions #14

Updated by okurz 11 months ago

  • Subject changed from [O3]http connection to O3 repo is broken sporadically in virtualization tests size:S to [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx
  • Category changed from Support to Regressions/Crashes
  • Status changed from Feedback to New
  • Assignee deleted (okurz)

kraih wrote:

I think this is actually caused by dependencies in the openQA systemd unit files. Probably those just need to be updated from apache2 to nginx.

Maybe that dependency can be generalized. If not then maybe the dependency can be reduced as there are multiple levels in systemd, I think. Like not definitely need apache/nginx but want to have something like apache or nginx but if not then fine as well :)

Removing estimate as likely we should reconsider and fix it properly accordingly.

Actions #15

Updated by okurz 11 months ago

  • Parent task set to #108209
Actions #16

Updated by okurz 11 months ago

  • Subject changed from [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx to [O3]http connection to O3 repo is broken sporadically in virtualization tests, likely due to systemd dependencies on apache/nginx size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #17

Updated by okurz 11 months ago

  • Copied to action #131024: Ensure both nginx+apache are properly covered in packages+testing+documentation size:S added
Actions #18

Updated by livdywan 11 months ago

  • Status changed from Workable to Feedback
  • Assignee set to livdywan

Julie_CAO wrote:

  • Make apache2 and nginx exclusive or use a generic target in systemd or just drop it
  • Only single-instance package "requires" apache2 in the spec, otherwise we "recommend"
  • All systemd services use Wants (livehandler, websockets, webui)

    sudo systemctl reset-failed apache2
    sudo systemctl mask apache2
    sudo systemctl restart openqa-webui

  1. Ensure Apache won't continue to show as "failed" (it doesn't seem like it should be doing anything currently consider Jun 16 13:53:17 ariel start_apache2[11260]: (98)Address already in use: AH00072: make_sock: could not bind to address [::]:80 according to the journal).
  2. Mask the service
  3. Apache's no longer pulled in as a dependency

So by testing it out I confirmed that "disabled" is not enough even for "wants", but masking seems to work fine. Most likely we don't want to change anything to our packaging/ services.

Actions #19

Updated by okurz 10 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)

somebody else should pick up for now

Actions #20

Updated by mkittler 10 months ago

Looks like the masking is still effective. So there's no urgent problem on o3 anymore. So wouldn't it make more sense to continue working on the "blocked" ticket #131024. When that ticket has been resolved we can supposedly remove the masking and consider this ticket resolved as well. Otherwise I wouldn't know what's left to do here.

Actions #21

Updated by okurz 10 months ago

Well, the idea was to adapt or generalize any systemd dependencies where possible and necessary first. That's why I was convinced that this ticket should be worked on first. But if you feel that is the wrong way around feel welcome to pick up the other independently. I will update #131024 accordingly

Actions #22

Updated by mkittler 10 months ago

  • Assignee set to mkittler
Actions #23

Updated by mkittler 10 months ago

  • Status changed from Workable to In Progress

Ok, then I'll adapt/generalize systemd dependencies as part of this ticket first.

Actions #24

Updated by mkittler 10 months ago

  • Status changed from In Progress to Feedback

I haven't found a generic service one could depend on. So I guess the easiest solution is to simply remove the dependency on systemd level (see https://github.com/os-autoinst/openQA/pull/5213). It is a breaking change for some setups (none of our production setups and likely not most setups in general) but I think we can live with that (better than over engineering a more complicated solution).

Actions #25

Updated by mkittler 10 months ago

  • Status changed from Feedback to Resolved

The PR has been merged and we haven't had to revert it yet. It will take a while until all our users will update. Likely it makes no sense to keep this ticket open for a month or so. So I'm resolving it right now and we can still re-open it later if necessary.

Actions

Also available in: Atom PDF