coordination #122650
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones
[epic] Fix firewall block and improve error reporting when test fails in curl log upload
100%
Description
Observation¶
openQA test in scenario sle-15-SP5-Online-s390x-xfstests_xfs-generic@s390x-kvm-sle15 fails in
generate_report
All xfstests runs in sle-15-SP5 s390x fails on that issue.
In this specific case the connection attempt with failed curl was from (reading out from vars.json)
"SUT_IP" : "s390kvm082.suse.de",
"VIRSH_GUEST" : "10.161.145.82",
"VIRSH_HOSTNAME" : "s390zp18.suse.de",
At first, I thought this is the same issue under debugging in #120261, but after that solution(https://github.com/os-autoinst/openQA/pull/4935/files) merged our fails in s390x still. By looking into the details I don't know why these tests still use worker2.oqa.suse.de as the download IP. Previous last good used IP address not use FQDN. May need some help by the tools team.
okurz ran time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
which reproduces the problem quite explicitly:
# time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:02:10 --:--:-- 0curl: (7) Failed to connect to worker2.oqa.suse.de port 20343: Connection timed out
real 2m11.316s
so very likely the firewall for the .oqa.suse.de zone just drops packets from 10.161.0.0
Reproducible¶
Fails since (at least) Build 40.1
Expected result¶
Last good: build38.1 http://openqa.suse.de/tests/9886322#step/generate_report/2
Suggestions¶
- Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results #122653
- Ask SUSE-IT network admins to not block this traffic which we need for tests #122656
- As it looks like default connect timeout for curl resolves to 2m10s (see above) so that is above our default timeouts for script_run, etc., so find a combination where curl has a chance to provide a proper error earlier. Consider using
upload_logs
in this specific example but this does not completely help.upload_logs
uses a default timeout of 90s which is higher than the default forscript_run
of 30s which is still below the default for curl accounting to 2m10s. Maybe we add the parameter--connect-timeout 20
to curl or bump the timeout for upload_logs #122659 - Ensure the original problem is fixed #122539
Further details¶
Link to latest