Project

General

Profile

Actions

coordination #122650

closed

QA - coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

QA - coordination #116623: [epic] Migration of SUSE Nbg based openQA+QA+QAM systems to new security zones

[epic] Fix firewall block and improve error reporting when test fails in curl log upload

Added by okurz over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2022-12-29
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

openQA test in scenario sle-15-SP5-Online-s390x-xfstests_xfs-generic@s390x-kvm-sle15 fails in
generate_report
All xfstests runs in sle-15-SP5 s390x fails on that issue.

In this specific case the connection attempt with failed curl was from (reading out from vars.json)
"SUT_IP" : "s390kvm082.suse.de",
"VIRSH_GUEST" : "10.161.145.82",
"VIRSH_HOSTNAME" : "s390zp18.suse.de",

At first, I thought this is the same issue under debugging in #120261, but after that solution(https://github.com/os-autoinst/openQA/pull/4935/files) merged our fails in s390x still. By looking into the details I don't know why these tests still use worker2.oqa.suse.de as the download IP. Previous last good used IP address not use FQDN. May need some help by the tools team.

okurz ran time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log which reproduces the problem quite explicitly:

# time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--     0curl: (7) Failed to connect to worker2.oqa.suse.de port 20343: Connection timed out
real    2m11.316s

so very likely the firewall for the .oqa.suse.de zone just drops packets from 10.161.0.0

Reproducible

Fails since (at least) Build 40.1

Expected result

Last good: build38.1 http://openqa.suse.de/tests/9886322#step/generate_report/2

Suggestions

  1. Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results #122653
  2. Ask SUSE-IT network admins to not block this traffic which we need for tests #122656
  3. As it looks like default connect timeout for curl resolves to 2m10s (see above) so that is above our default timeouts for script_run, etc., so find a combination where curl has a chance to provide a proper error earlier. Consider using upload_logs in this specific example but this does not completely help. upload_logs uses a default timeout of 90s which is higher than the default for script_run of 30s which is still below the default for curl accounting to 2m10s. Maybe we add the parameter --connect-timeout 20 to curl or bump the timeout for upload_logs #122659
  4. Ensure the original problem is fixed #122539

Further details

Link to latest


Subtasks 5 (0 open5 closed)

openQA Tests - action #122539: test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:MClosed2022-12-29

Actions
action #122608: exit code of shell command not received by script_runResolvedokurz2023-01-02

Actions
openQA Infrastructure - action #122653: Ask SUSE-IT network admins to REJECT packets instead of DROP so that we get more clear results size:SRejectedokurz2023-01-03

Actions
openQA Infrastructure - action #122656: Ask SUSE-IT network admins to *not* block this traffic which we need for tests regarding s390x within SUSE network size:MResolvedokurz2023-01-03

Actions
action #122659: Improved error reporting in openQA tests when curl times out on connection attemptsRejectedokurz2023-01-03

Actions

Related issues 1 (1 open0 closed)

Related to openQA Project - coordination #122665: [epic] Improved PowerVM testingNew2023-01-26

Actions
Actions #1

Updated by okurz over 1 year ago

  • Copied from action #122539: test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:M added
Actions #2

Updated by okurz over 1 year ago

  • Description updated (diff)
  • Status changed from New to Blocked

blocked by subtasks

Actions #3

Updated by okurz over 1 year ago

Actions #4

Updated by okurz over 1 year ago

  • Tags set to reactive work
Actions #5

Updated by okurz about 1 year ago

  • Parent task set to #116623
Actions #6

Updated by okurz 10 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Target version changed from Ready to future

We can not work on improving the error reporting right now so moving out of backlog.

Actions #7

Updated by okurz 5 months ago

  • Status changed from New to Resolved
  • Assignee set to okurz
  • Target version changed from future to Ready

With NUE1 decommissioned all active systems are in new security zones and I guess machines that are brought (back) into production will also end up in new security zones. No specific work for improving error reporting here was done and I don't think we need to improve that further. We need to rely on SUSE-IT to monitor their firewall accordingly.

Actions

Also available in: Atom PDF