coordination #122650
closedQA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
QA (public) - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones
[epic] Fix firewall block and improve error reporting when test fails in curl log upload
100%
Description
Observation¶
openQA test in scenario sle-15-SP5-Online-s390x-xfstests_xfs-generic@s390x-kvm-sle15 fails in
generate_report
All xfstests runs in sle-15-SP5 s390x fails on that issue.
In this specific case the connection attempt with failed curl was from (reading out from vars.json)
"SUT_IP" : "s390kvm082.suse.de",
"VIRSH_GUEST" : "10.161.145.82",
"VIRSH_HOSTNAME" : "s390zp18.suse.de",
At first, I thought this is the same issue under debugging in #120261, but after that solution(https://github.com/os-autoinst/openQA/pull/4935/files) merged our fails in s390x still. By looking into the details I don't know why these tests still use worker2.oqa.suse.de as the download IP. Previous last good used IP address not use FQDN. May need some help by the tools team.
okurz ran time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
which reproduces the problem quite explicitly:
# time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:02:10 --:--:-- 0curl: (7) Failed to connect to worker2.oqa.suse.de port 20343: Connection timed out
real 2m11.316s
so very likely the firewall for the .oqa.suse.de zone just drops packets from 10.161.0.0
Reproducible¶
Fails since (at least) Build 40.1
Expected result¶
Last good: build38.1 http://openqa.suse.de/tests/9886322#step/generate_report/2
Suggestions¶
- Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results #122653
- Ask SUSE-IT network admins to not block this traffic which we need for tests #122656
- As it looks like default connect timeout for curl resolves to 2m10s (see above) so that is above our default timeouts for script_run, etc., so find a combination where curl has a chance to provide a proper error earlier. Consider using
upload_logs
in this specific example but this does not completely help.upload_logs
uses a default timeout of 90s which is higher than the default forscript_run
of 30s which is still below the default for curl accounting to 2m10s. Maybe we add the parameter--connect-timeout 20
to curl or bump the timeout for upload_logs #122659 - Ensure the original problem is fixed #122539
Further details¶
Link to latest
Updated by okurz almost 2 years ago
- Copied from action #122539: test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:M added
Updated by okurz almost 2 years ago
- Description updated (diff)
- Status changed from New to Blocked
blocked by subtasks
Updated by okurz almost 2 years ago
- Related to coordination #122665: [epic] Improved PowerVM testing added
Updated by okurz over 1 year ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
We can not work on improving the error reporting right now so moving out of backlog.
Updated by okurz about 1 year ago
- Status changed from New to Resolved
- Assignee set to okurz
- Target version changed from future to Ready
With NUE1 decommissioned all active systems are in new security zones and I guess machines that are brought (back) into production will also end up in new security zones. No specific work for improving error reporting here was done and I don't think we need to improve that further. We need to rely on SUSE-IT to monitor their firewall accordingly.