action #122539
closedopenQA Project (public) - coordination #122650: [epic] Fix firewall block and improve error reporting when test fails in curl log upload
test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:M
Added by yosun almost 2 years ago. Updated about 1 year ago.
0%
Description
Observation¶
openQA test in scenario sle-15-SP5-Online-s390x-xfstests_xfs-generic@s390x-kvm-sle15 fails in
generate_report
All xfstests runs in sle-15-SP5 s390x fails on that issue.
In this specific case the connection attempt with failed curl was from (reading out from vars.json)
"SUT_IP" : "s390kvm082.suse.de",
"VIRSH_GUEST" : "10.161.145.82",
"VIRSH_HOSTNAME" : "s390zp18.suse.de",
At first, I thought this is the same issue under debugging in #120261, but after that solution(https://github.com/os-autoinst/openQA/pull/4935/files) merged our fails in s390x still. By looking into the details I don't know why these tests still use worker2.oqa.suse.de as the download IP. Previous last good used IP address not use FQDN. May need some help by the tools team.
okurz ran time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
which reproduces the problem quite explicitly:
# time curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:02:10 --:--:-- 0curl: (7) Failed to connect to worker2.oqa.suse.de port 20343: Connection timed out
real 2m11.316s
so very likely the firewall for the .oqa.suse.de zone just drops packets from 10.161.0.0
Reproducible¶
Fails since (at least) Build 40.1
Expected result¶
Last good: build38.1 http://openqa.suse.de/tests/9886322#step/generate_report/2
Further details¶
Link to latest
Updated by okurz almost 2 years ago
- Tracker changed from action to coordination
- Subject changed from test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out to [epic] test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out
- Description updated (diff)
- Assignee set to okurz
Updated by okurz almost 2 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Category changed from Bugs in existing tests to Regressions/Crashes
Updated by okurz almost 2 years ago
- Copied to coordination #122650: [epic] Fix firewall block and improve error reporting when test fails in curl log upload added
Updated by okurz almost 2 years ago
- Tracker changed from coordination to action
- Project changed from openQA Project (public) to openQA Tests (public)
- Subject changed from [epic] test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out to test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out
- Description updated (diff)
- Category changed from Regressions/Crashes to Infrastructure
- Status changed from New to In Progress
I created an epic #122650 to improve the overall situation and some further ideas we had. With this I brought this ticket back to the original place and will treat it as a support ticket trying to help the OP.
@yosun the main problem seems to be that a firewall maintained by SUSE-IT EngInfra blocks traffic between the s390x hosts in the IP range 10.161.144.0/20 and .oqa.suse.de. I asked in https://suse.slack.com/archives/C0488BZNA5S/p1672748720536449?thread_ts=1670498352.238729&cid=C0488BZNA5S to resolve this
Updated by okurz almost 2 years ago
- Due date set to 2023-01-17
- Status changed from In Progress to Feedback
- Parent task set to #122650
@yosun next to the problem that obviously the firewall shouldn't block the traffic there are more things you should do to improve the test code:
The test code calls
script_run('curl -O ' . autoinst_url . "/files/status.log; cat status.log > $STATUS_LOG");
which is visible in the autoinst-log.txt as:
[2022-12-22T17:35:46.807946+01:00] [debug] [pid:9263] tests/xfstests/generate_report.pm:110 called testapi::script_run
[2022-12-22T17:35:46.808225+01:00] [debug] [pid:9263] <<< testapi::script_run(cmd="curl -O http://worker2.oqa.suse.de:20343/rfhqRYw7W_g045X2/files/status.log; cat status.log > /opt/status.log", timeout=30, quiet=undef, output="", die_on_timeout=1)
…
[2022-12-22T17:37:16.933372+01:00] [debug] [pid:9263] <<< testapi::wait_serial(expect_not_found=0, no_regex=0, buffer_size=undef, record_output=undef, quiet=undef, timeout=30, regexp=qr/oLnFf-\d+-/u)
[2022-12-22T17:37:16.934172+01:00] [debug] [pid:9264] <<< consoles::serial_screen::read_until(no_regex=0, regexp="(?^u:oLnFf-\\d+-)", cmd="backend_wait_serial", expect_not_found=0, json_cmd_token="xvEkOyoS", buffer_size=undef, quiet=undef, record_output=undef, pattern="(?^u:oLnFf-\\d+-)", timeout=30)
[2022-12-22T17:37:46.973650+01:00] [debug] [pid:9263] >>> testapi::wait_serial: (?^u:oLnFf-\d+-): fail
which in the end took 2m = 120s due to the combination of default wait_serial timeout of 90s+30s default timeout of script_run
. But the internal connection attempt timeout of curl sums up to be 2m10s = 130s when I tried it so you will have openQA test timeouts and no clear error indication of curl. I suggest to always have a higher timeout on outer layers of calls than any internal timeouts so in this specific case I suggest you call script_run('curl …'
with a higher timeout than 130s. Or consider using the -m
option of curl, e.g. -m20
to only allow up to 20s for curl to take for a complete operation.
Updated by okurz almost 2 years ago
- Related to coordination #122665: [epic] Improved PowerVM testing added
Updated by yosun almost 2 years ago
Thank you for taking care of this issue, I'll set a higher timeout than 130s as you said.
Updated by livdywan almost 2 years ago
- Subject changed from test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out to test fails in curl log from openqa and connect with FQDN worker2.oqa.suse.de always fails by time out size:M
Updated by yosun almost 2 years ago
Hi Oli, I updated the timeout to 300s, and run the verify run[1] still fails in timeout. I am aware you are still working on it with Lazaros, so just sync the info with you. Thanks a lot!
[1] https://openqa.suse.de/tests/10275671#step/generate_report/4
Updated by livdywan almost 2 years ago
@okurz Is there any update in the meanwhile? And as always update can also be planning to look into it. Just checking since this is due tomorrow.
Updated by okurz almost 2 years ago
- Due date deleted (
2023-01-17) - Status changed from Feedback to Blocked
Now blocked by #122656
Updated by okurz almost 2 years ago
- Status changed from Blocked to In Progress
#122656 resolved. Retriggered xfstests with https://openqa.suse.de/tests/10380574 to check current state.
Updated by openqa_review almost 2 years ago
- Due date set to 2023-02-10
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz almost 2 years ago
I used the command for i in root@s390zp18.suse.de qamaster.qa.suse.de; do echo $i && ssh $i "timeout 5 curl -sS http://worker2.oqa.suse.de:20999/foo"; done
and s390zp18 runs into timeout, qamaster says "Connection refused" as I expected, so 1. it looks like communication between .qa.suse.de and .oqa.suse.de generally works, can we say that? 2. And then why does s390zp18 behave differently? Because of an older OS SLE12SP5? What does that tell us for the original issue?
I tend to assign the ticket to [qe-core] asking them to ensure that we have a properly supported, upgraded OS as base which would also offer tooling like mtr otherwise I would not know how to efficiently continue with investigation.
Updated by yosun almost 2 years ago
When checking that you are looking for an efficient reproduce way, I try a way to reproduce this bug more efficiently to add this parameter when testing: XFSTESTS_RANGES=generic/255-257
It tests only 3 tests, and contains pass and fail tests. It could finish in 40 minutes. I'm looking forward to reproduce this issue but failed, it passed the curl step. https://openqa.suse.de/tests/10393889#step/generate_report/2
Two guesses in the next step:
- Short log message makes it pass with a shorter time to curl the log. And previous fails caused by sub tests number up to size it upper to some limit TIMEOUT?
- Maybe because today it's Saturday and the network traffic in openqa worker is good?
To verify guess#2 I trigger another full-size test. Let's see what happens. -> https://openqa.suse.de/t10394027
To verify guess#1 may need step by step to add the testcases size.
Updated by yosun almost 2 years ago
Today I tried to run only all fail tests, and test log could download successfully(https://openqa.suse.de/tests/10397076). So this issue is not caused by some new test failure info inside of log.
And I try to run only half of full testscript(https://openqa.suse.de/tests/10397124#step/generate_report/3), the test log size much shorter than before(the last good). And test run still timeout in curl download part. Then it seems still need more help to debug in infra part.
Updated by okurz almost 2 years ago
That's a good information. So it's for sure no problem with any firewall blocking ports. The change of infrastructure might have an impact of course, in particular if now the communication goes over more hops and is of lower performance. So if the problem is about downloading a file that is too big then do you think it would be possible to use wget instead?
Updated by okurz almost 2 years ago
- Due date changed from 2023-02-10 to 2023-03-10
- Status changed from In Progress to Feedback
- Priority changed from High to Normal
Updated by yosun almost 2 years ago
I tried to use wget to replace curl to download log still timeout with default wget timeout(900s) https://openqa.suse.de/tests/10481457#step/generate_report/2
And I get another try with wget timeout=3600 which is kind of ridiculous to download a less than 1M log...let's see what happens https://openqa.suse.de/t10482126
Updated by okurz almost 2 years ago
ok, thanks for trying. The latter failed as well with timeout. Honestly I don't know what in particular the code
# Reload uploaded status log back to file
script_run('wget --timeout=3600 ' . autoinst_url . "/files/status.log; cat status.log > $STATUS_LOG");
tries to achieve or why that is done. Where does the file /files/status.log originally come from? And why do you want to download it from the worker host back into the SUT?
And can you help to find easier steps to reproduce that run within seconds or minutes and not 40m or hours?
Updated by yosun almost 2 years ago
okurz wrote:
ok, thanks for trying. The latter failed as well with timeout. Honestly I don't know what in particular the code
# Reload uploaded status log back to file script_run('wget --timeout=3600 ' . autoinst_url . "/files/status.log; cat status.log > $STATUS_LOG");
tries to achieve or why that is done. Where does the file /files/status.log originally come from? And why do you want to download it from the worker host back into the SUT?
This file comes from previous steps in run.pm, it uses openqa api save_tmp_file to save logs when 1. all tests were done 2. crash/hang happens and reboot system. Because a reboot will remove that log in a new reboot snapshot, so need to upload to openqa first. And in generate_report.pm use this log to analyze results.
And can you help to find easier steps to reproduce that run within seconds or minutes and not 40m or hours?
At the moment, the shortest test case I could reproduce this issue is to set XFSTESTS_RANGES=xfs/0-100 to use 100 tests to reproduce it instead of 0-999. But it still takes more than 1 hour.
I'll let you known if I have some new idea.
Updated by openqa_review almost 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: xfstests_xfs-generic
https://openqa.suse.de/tests/10562519#step/generate_report/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz almost 2 years ago
- Due date deleted (
2023-03-10) - Status changed from Feedback to New
- Assignee deleted (
okurz) - Target version changed from Ready to future
yosun wrote:
And can you help to find easier steps to reproduce that run within seconds or minutes and not 40m or hours?
At the moment, the shortest test case I could reproduce this issue is to set XFSTESTS_RANGES=xfs/0-100 to use 100 tests to reproduce it instead of 0-999. But it still takes more than 1 hour.
I'll let you known if I have some new idea.
Until we have an easy way to reproduce we don't have further ideas what to do. Removing from our backlog for now.
Updated by openqa_review over 1 year ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: xfstests_xfs-generic
https://openqa.suse.de/tests/10718602#step/generate_report/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by openqa_review over 1 year ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: xfstests_xfs-generic
https://openqa.suse.de/tests/10941023#step/generate_report/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.
Updated by openqa_review over 1 year ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: xfstests_xfs-generic
https://openqa.suse.de/tests/11172627#step/generate_report/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 112 days if nothing changes in this ticket.
Updated by yosun about 1 year ago
- Status changed from New to Closed
I add a determine before curl in xfstests related code, to avoid curl from server if local has a copy of the log file to reduce this kind of fails. If no other blocking issue by this we could close this ticket. Feel free to reopen if it still needed.