action #81026
closed
many jobs incomplete with auto_review:"(?s)Running on openqaworker-arm-2.*failed: 521 Connect timeout.*Result: setup failure":retry
Added by okurz about 4 years ago.
Updated almost 4 years ago.
Description
Observation¶
https://openqa.suse.de/tests/5169445/file/autoinst-log.txt shows
[2020-12-14T04:53:56.0322 UTC] [info] [pid:44641] Downloading SLES-12-SP5-aarch64-GM-gnome.qcow2, request #131 sent to Cache Service
[2020-12-14T04:59:08.0750 UTC] [info] [pid:44641] Download of SLES-12-SP5-aarch64-GM-gnome.qcow2 processed:
[info] [#131] Cache size of "/var/lib/openqa/cache" is 0 Byte, with limit 50GiB
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" failed: 521 Connect timeout
[info] [#131] Download error 521, waiting 5 seconds for next try (4 remaining)
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" failed: 521 Connect timeout
[info] [#131] Download error 521, waiting 5 seconds for next try (3 remaining)
[info] [#131] Downloading "SLES-12-SP5-aarch64-GM-gnome.qcow2" from "http://openqa.suse.de/tests/5169445/asset/hdd/SLES-12-SP5-aarch64-GM-gnome.qcow2"
[info] [#131] Size of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" is 2.3GiB, with ETag ""94770000-5b134d5e571db""
[info] [#131] Download of "/var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2" successful, new cache size is 7.6GiB
[2020-12-14T04:59:08.0754 UTC] [error] [pid:44641] Failed to download SLES-12-SP5-aarch64-GM-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-12-SP5-aarch64-GM-gnome.qcow2
[2020-12-14T05:01:21.0824 UTC] [info] [pid:44641] +++ worker notes +++
[2020-12-14T05:01:21.0825 UTC] [info] [pid:44641] End time: 2020-12-14 05:01:21
[2020-12-14T05:01:21.0826 UTC] [info] [pid:44641] Result: setup failure
[2020-12-14T05:01:21.0860 UTC] [info] [pid:50249] Uploading autoinst-log.txt
found on openqaworker-arm-2 that ping -4 openqa.suse.de
works but ping -6 openqa.suse.de
does not, ping -6 localhost
does work.
I triggered a reboot of openqaworker-arm-2, not sure if this helps.
Problem¶
We should likely disable IPv6 again until we find a good solution.
Workaround¶
Retrigger and hope that jobs end up on other machines. On affected machine disable IPV6/reboot/restart.
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Unfortunately a reboot didn't help. But I realized that our "disable_ipv6" workaround where still in place. I'm not 100% sure where it came from but setting them back to 0 and restarting the network finally resolved the issue.
I will take this as tracker ticket to rectify the following:
openqa:~ # salt -l error --no-color -C 'G@roles:worker' cmd.run 'sysctl -a | grep disable_ipv6 | grep -v tap | grep "\=\ 1"'
openqaworker2.suse.de:
net.ipv6.conf.br2.disable_ipv6 = 1
net.ipv6.conf.br3.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker9.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
QA-Power8-4-kvm.qa.suse.de:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.eth0.disable_ipv6 = 1
net.ipv6.conf.eth1.disable_ipv6 = 1
net.ipv6.conf.eth2.disable_ipv6 = 1
net.ipv6.conf.ovs-system.disable_ipv6 = 1
malbec.arch.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker8.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker6.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker5.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker13.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
grenache-1.qa.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker10.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
QA-Power8-5-kvm.qa.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker-arm-1.suse.de:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.erspan0.disable_ipv6 = 1
net.ipv6.conf.eth1.disable_ipv6 = 1
net.ipv6.conf.eth2.disable_ipv6 = 1
net.ipv6.conf.eth3.disable_ipv6 = 1
net.ipv6.conf.eth4.disable_ipv6 = 1
net.ipv6.conf.gre_sys.disable_ipv6 = 1
net.ipv6.conf.ovs-system.disable_ipv6 = 1
openqaworker-arm-2.suse.de:
net.ipv6.conf.ovs-system.disable_ipv6 = 1
- Priority changed from High to Normal
Okay, I ran the following command: sysctl -a | grep disable_ipv6 | cut -d= -f 1 | awk '{$1=$1;print}' | xargs -I {} sysctl {}=0
on arm1, QA-Power8-4 and openqaworker2 now and restarted the network over ipmi. I checked with salt -l error --no-color -C 'G@roles:worker' cmd.run 'cat /etc/sysctl.d/* | grep disable_ipv6 || true'
if there are still left-over workaround files - none where present. I validated with the following if all workers can reach OSD:
openqa:~ # salt -l error --no-color -C 'G@roles:worker' cmd.run 'ping -W 1 -c 1 -6 openqa.suse.de > /dev/null && echo "able to reach OSD over v6" || echo "IPv6 BROKEN"'
openqaworker2.suse.de:
able to reach OSD over v6
QA-Power8-4-kvm.qa.suse.de:
able to reach OSD over v6
openqaworker9.suse.de:
able to reach OSD over v6
QA-Power8-5-kvm.qa.suse.de:
able to reach OSD over v6
openqaworker5.suse.de:
able to reach OSD over v6
openqaworker8.suse.de:
able to reach OSD over v6
openqaworker6.suse.de:
able to reach OSD over v6
grenache-1.qa.suse.de:
able to reach OSD over v6
malbec.arch.suse.de:
able to reach OSD over v6
openqaworker10.suse.de:
able to reach OSD over v6
openqaworker13.suse.de:
able to reach OSD over v6
openqaworker-arm-1.suse.de:
able to reach OSD over v6
openqaworker-arm-2.suse.de:
able to reach OSD over v6
I saw some interesting behavior on the arm workers that they where only able to ping OSD once the device entered "promiscuous mode" which I triggered by connecting wireshark over ssh. Maybe this was just pure coincidence but I wanted to note this down somewhere at least…
- Status changed from In Progress to Feedback
Okay, had to recover arm-2 once again with a power cycle. However, after the reboot it was still able to access OSD over v6 which I already see as improvement. I will keep this on feedback until I see at least one successful download in the openqa-worker-cacheservice-minion.service log
- Status changed from Feedback to Closed
- Related to action #81198: [tracker-ticket] openqaworker-arm-{1..3} have network problems (cacheservice, OSD reachability). IPv6 disabled for now added
- Status changed from Closed to Resolved
Also available in: Atom
PDF