Project

General

Profile

action #81026

many jobs incomplete with auto_review:"(?s)Running on openqaworker-arm-2.*failed: 521 Connect timeout.*Result: setup failure":retry

Added by okurz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2020-12-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/5169445/file/autoinst-log.txt shows

[2020-12-14T04:53:56.0322 UTC] [info] [pid:44641] Downloading SLES-12-SP5-aarch64-GM-gnome.qcow2, request #131 sent to Cache Service
[2020-12-14T04:59:08.0750 UTC] [info] [pid:44641] Download of SLES-12-SP5-aarch64-GM-gnome.qcow2 processed:
[info] [#131] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" failed: 521 Connect timeout
[info] [#131] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" failed: 521 Connect timeout
[info] [#131] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" is 2.3GiB, with ETag ""94770000-5b134d5e571db""
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" successful, new cache size is 7.6GiB

[2020-12-14T04:59:08.0754 UTC] [error] [pid:44641] Failed to download SLES-12-SP5-aarch64-GM-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2
[2020-12-14T05:01:21.0824 UTC] [info] [pid:44641] +++ worker notes +++
[2020-12-14T05:01:21.0825 UTC] [info] [pid:44641] End time: 2020-12-14 05:01:21
[2020-12-14T05:01:21.0826 UTC] [info] [pid:44641] Result: setup failure
[2020-12-14T05:01:21.0860 UTC] [info] [pid:50249] Uploading autoinst-log.txt

found on openqaworker-arm-2 that ping -4 openqa.suse.de works but ping -6 openqa.suse.de does not, ping -6 localhost does work.

I triggered a reboot of openqaworker-arm-2, not sure if this helps.

Problem

We should likely disable IPv6 again until we find a good solution.

Workaround

Retrigger and hope that jobs end up on other machines. On affected machine disable IPV6/reboot/restart.


Related issues

Related to openQA Infrastructure - action #81198: [tracker-ticket] openqaworker-arm-{1..3} have network problems (cacheservice, OSD reachability). IPv6 disabled for nowNew2020-12-18

History

#1 Updated by nicksinger 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger

Unfortunately a reboot didn't help. But I realized that our "disable_ipv6" workaround where still in place. I'm not 100% sure where it came from but setting them back to 0 and restarting the network finally resolved the issue.
I will take this as tracker ticket to rectify the following:

openqa:~ # salt -l error --no-color -C 'G@roles:worker' cmd.run 'sysctl -a | grep disable_ipv6 | grep -v tap | grep "\=\ 1"'
openqaworker2.suse.de:
    net.ipv6.conf.br2.disable_ipv6 = 1
    net.ipv6.conf.br3.disable_ipv6 = 1
    net.ipv6.conf.eth0.disable_ipv6 = 1
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker9.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
QA-Power8-4-kvm.qa.suse.de:
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.eth0.disable_ipv6 = 1
    net.ipv6.conf.eth1.disable_ipv6 = 1
    net.ipv6.conf.eth2.disable_ipv6 = 1
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
malbec.arch.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker8.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker6.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker5.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker13.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
grenache-1.qa.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker10.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
QA-Power8-5-kvm.qa.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker-arm-1.suse.de:
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.erspan0.disable_ipv6 = 1
    net.ipv6.conf.eth1.disable_ipv6 = 1
    net.ipv6.conf.eth2.disable_ipv6 = 1
    net.ipv6.conf.eth3.disable_ipv6 = 1
    net.ipv6.conf.eth4.disable_ipv6 = 1
    net.ipv6.conf.gre_sys.disable_ipv6 = 1
    net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker-arm-2.suse.de:
    net.ipv6.conf.ovs-system.disable_ipv6 = 1

#2 Updated by nicksinger 7 months ago

  • Priority changed from High to Normal

Okay, I ran the following command: sysctl -a | grep disable_ipv6 | cut -d= -f 1 | awk '{$1=$1;print}' | xargs -I {} sysctl {}=0 on arm1, QA-Power8-4 and openqaworker2 now and restarted the network over ipmi. I checked with salt -l error --no-color -C 'G@roles:worker' cmd.run 'cat /etc/sysctl.d/* | grep disable_ipv6 || true' if there are still left-over workaround files - none where present. I validated with the following if all workers can reach OSD:

openqa:~ # salt -l error --no-color -C 'G@roles:worker' cmd.run 'ping -W 1 -c 1 -6 openqa.suse.de > /dev/null && echo "able to reach OSD over v6" || echo "IPv6 BROKEN"'
openqaworker2.suse.de:
    able to reach OSD over v6
QA-Power8-4-kvm.qa.suse.de:
    able to reach OSD over v6
openqaworker9.suse.de:
    able to reach OSD over v6
QA-Power8-5-kvm.qa.suse.de:
    able to reach OSD over v6
openqaworker5.suse.de:
    able to reach OSD over v6
openqaworker8.suse.de:
    able to reach OSD over v6
openqaworker6.suse.de:
    able to reach OSD over v6
grenache-1.qa.suse.de:
    able to reach OSD over v6
malbec.arch.suse.de:
    able to reach OSD over v6
openqaworker10.suse.de:
    able to reach OSD over v6
openqaworker13.suse.de:
    able to reach OSD over v6
openqaworker-arm-1.suse.de:
    able to reach OSD over v6
openqaworker-arm-2.suse.de:
    able to reach OSD over v6

I saw some interesting behavior on the arm workers that they where only able to ping OSD once the device entered "promiscuous mode" which I triggered by connecting wireshark over ssh. Maybe this was just pure coincidence but I wanted to note this down somewhere at least…

#3 Updated by nicksinger 7 months ago

  • Status changed from In Progress to Feedback

Okay, had to recover arm-2 once again with a power cycle. However, after the reboot it was still able to access OSD over v6 which I already see as improvement. I will keep this on feedback until I see at least one successful download in the openqa-worker-cacheservice-minion.service log

#4 Updated by nicksinger 6 months ago

  • Status changed from Feedback to Closed

closing as https://progress.opensuse.org/issues/81198 #81198 covers all arm workers and the disabled v6 workaround.

#5 Updated by okurz 6 months ago

  • Related to action #81198: [tracker-ticket] openqaworker-arm-{1..3} have network problems (cacheservice, OSD reachability). IPv6 disabled for now added

#6 Updated by okurz 6 months ago

  • Status changed from Closed to Resolved

Also available in: Atom PDF