coordination #122650: [epic] Fix firewall block and improve error reporting when test fails in curl log upload - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

coordination #122650

closed

QA (public) - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA (public) - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

[epic] Fix firewall block and improve error reporting when test fails in curl log upload

Added by okurz over 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Feature requests

Target version:

Ready

Start date:

2022-12-29

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Tags:

reactive work

Description

Observation¶

openQA test in scenario sle-15-SP5-Online-s390x-xfstests_xfs-generic@s390x-kvm-sle15 fails in
generate_report
All xfstests runs in sle-15-SP5 s390x fails on that issue.

In this specific case the connection attempt with failed curl was from (reading out from vars.json)
"SUT_IP" : "s390kvm082.suse.de",
"VIRSH_GUEST" : "10.161.145.82",
"VIRSH_HOSTNAME" : "s390zp18.suse.de",

At first, I thought this is the same issue under debugging in #120261, but after that solution(https://github.com/os-autoinst/openQA/pull/4935/files) merged our fails in s390x still. By looking into the details I don't know why these tests still use worker2.oqa.suse.de as the download IP. Previous last good used IP address not use FQDN. May need some help by the tools team.

okurz ran time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log which reproduces the problem quite explicitly:

# time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--     0curl: (7) Failed to connect to worker2.oqa.suse.de port 20343: Connection timed out
real    2m11.316s

so very likely the firewall for the .oqa.suse.de zone just drops packets from 10.161.0.0

Reproducible¶

Fails since (at least) Build 40.1

Expected result¶

Last good: build38.1 http://openqa.suse.de/tests/9886322#step/generate_report/2

Suggestions¶

Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results #122653
Ask SUSE-IT network admins to not block this traffic which we need for tests #122656
As it looks like default connect timeout for curl resolves to 2m10s (see above) so that is above our default timeouts for script_run, etc., so find a combination where curl has a chance to provide a proper error earlier. Consider using upload_logs in this specific example but this does not completely help. upload_logs uses a default timeout of 90s which is higher than the default for script_run of 30s which is still below the default for curl accounting to 2m10s. Maybe we add the parameter --connect-timeout 20 to curl or bump the timeout for upload_logs #122659
Ensure the original problem is fixed #122539

Further details¶

Link to latest

Subtasks 5 (0 open — 5 closed)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by okurz over 2 years ago

Copied from action #122539: test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:M added

Actions

Copy link

Updated by okurz over 2 years ago

Description updated (diff)
Status changed from New to Blocked

blocked by subtasks

Actions

Copy link

Updated by okurz over 2 years ago

Related to coordination #122665: [epic] Improved PowerVM testing added

Actions

Copy link

Updated by okurz over 2 years ago

Tags set to reactive work

Actions

Copy link

Updated by okurz over 2 years ago

Parent task set to #116623

Actions

Copy link

Updated by okurz almost 2 years ago

Status changed from Blocked to New
Assignee deleted (~~okurz~~)
Target version changed from Ready to future

We can not work on improving the error reporting right now so moving out of backlog.

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from New to Resolved
Assignee set to okurz
Target version changed from future to Ready

With NUE1 decommissioned all active systems are in new security zones and I guess machines that are brought (back) into production will also end up in new security zones. No specific work for improving error reporting here was done and I don't think we need to improve that further. We need to rely on SUSE-IT to monitor their firewall accordingly.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries