action #108464
[diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
0%
Description
Observation¶
openQA test in scenario microos-Tumbleweed-DVD-x86_64-remote_ssh_target@64bit fails in
textinfo
curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo; echo 7EEdq-$?-
curl: (7) Couldn't connect to server
7EEdq-7-
This has been failing quite frequently in this test lately - 'textinfo' passes reliably in all other tests.
"data/textinfo" refers to https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/data/textinfo
which is retrieved by
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/textinfo.pm
which is doing
$self->select_serial_terminal; assert_script_run('curl -O ' . data_url('textinfo'));
Test suite description¶
Maintainer: jrivera Boot with ssh=1 parameter and wait for parallel job (remote_ssh_controller) to install the system.
Reproducible¶
Fails since (at least) Build 20220315
Expected result¶
Last good: 20220314 (or more recent)
Suggestions¶
- Tests should wait until the network within the SUT is ready (and an IPv4 address is assigned) before trying to download data from the command server, see #108464#note-8.
- Likely not relevant anymore
- Maybe it only affects multi-machine tests as https://openqa.opensuse.org/tests/2247542/ is a multi-machine test
- Use https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger to label and retrigger same failures and by that be able to query the database for more matches
- Try to reproduce, e.g. with https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation on o3
- Improve investigation about the state of network within the post_fail_hook, e.g. if the HTTP request does not work, is the port reachable? Is the host pingable? The error message says "Couldn't connect to server" so looks like the host does not accept requests on the port at all (not "Connection refused") but later in the test we are able to upload over curl to the same command server.
- Try to get verbose output from curl on client side (maybe also verbose information from command server on os-autoinst side)
Further details¶
Always latest result in this scenario: latest
History
#1
Updated by szarate 4 months ago
- Project changed from openQA Tests to openQA Project
- Subject changed from test fails in textinfo to test fails in textinfo - Downloading from command server failed right after boot
- Category changed from Bugs in existing tests to Concrete Bugs
this looks like an sporadic issue, but I think it's worth investigating since assert_script_run('curl -O ' . data_url('textinfo'));
belongs in the backend and that's a critical thing :)
# Test died: command 'curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo' failed at /usr/lib/os-autoinst/testapi.pm line 953.
testapi::_handle_script_run_ret(7, "curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo", "fail_message", "", "timeout", 90, "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 991
testapi::assert_script_run("curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo") called at microos/tests/console/textinfo.pm line 39
textinfo::run(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/basetest.pm line 360
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
basetest::runtest(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/autotest.pm line 372
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 372
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 242
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 242
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 296
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790), CODE(0x55c8b0c86640)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/os-autoinst/autotest.pm line 298
autotest::start_process() called at /usr/bin/isotovideo line 256
#3
Updated by mkittler 3 months ago
The command server has definitely been started before:
[2022-03-16T19:36:37.458842+01:00] [info] cmdsrv: daemon reachable under http://*:20063/CldQCbjudrGe1EfF/ [2022-03-16T19:36:37.461389+01:00] [info] Listening at "http://[::]:20063" Web application available at http://[::]:20063
Note that Listening at "http://[::]:20063"
means it is listening on the IPv4 address and IPv6 address on all interfaces. So that shouldn't be a problem.
It also looks like further attempts to use the command server work, e.g. https://openqa.opensuse.org/tests/2247542#step/textinfo/29 and https://openqa.opensuse.org/tests/2247542#step/journal_check/14.
So maybe a bug in the specific route …/data/textinfo
.
By the way, where is the test data file/directory textinfo
supposed to come from? It would be useful to have this data to be able to reproduce the issue within a unit test.
#6
Updated by mkittler 3 months ago
Apart from one more failure it looks like the issue isn't happening anymore.
Try to get verbose output from curl on client side
According to man curl
"7" means "Failed to connect to host." (and not "Could not resolve host. The given remote host could not be resolved." which would be 6 and also not "Weird server reply. The server sent data curl could not parse." which would be 8). Considering the list of errors in the man page the error codes are very fine grained so I don't think adding a verbose flag to the curl invocation will help much. We have to find out why curl is unable to reach the command server. Maybe the SUT's networking is configured wrongly at this point. (Since uploading works later I suppose it cannot be completely/permanently broken.)
maybe also verbose information from command server on os-autoinst side
Likely also not very helpful it the command server cannot even be reached.
#7
Updated by mkittler 3 months ago
In the good case it looks the same:
[2022-03-16T21:31:39.110676+01:00] [info] cmdsrv: daemon reachable under http://*:20023/TZR66YKczH0t7Osh/ [2022-03-16T21:31:39.112762+01:00] [info] Listening at "http://[::]:20023" Web application available at http://[::]:20023 … [2022-03-16T21:41:47.779928+01:00] [debug] tests/console/textinfo.pm:39 called testapi::assert_script_run [2022-03-16T21:41:47.780093+01:00] [debug] <<< testapi::assert_script_run(cmd="curl -O http://10.0.2.2:20023/TZR66YKczH0t7Osh/data/textinfo", timeout=90, quiet=undef, fail_message="")
Except that the download actually works: https://openqa.opensuse.org/tests/2247711#step/textinfo/14
#8
Updated by mkittler 3 months ago
- Project changed from openQA Project to openQA Tests
- Category deleted (
Concrete Bugs) - Assignee deleted (
mkittler)
In this test the system is setup by a remote target. I don't think this should make a difference. However, it doesn't look like there's any attempt made to wait until the network is ready on the machine. It just waits for the login prompt and than immediately switches to the serial console attempting the download:
$self->select_serial_terminal; assert_script_run('curl -O ' . data_url('textinfo'));
I find it a bit weird that there actually is a test module networking
to test whether networking works but it only runs after textinfo
and doesn't wait until the network is actually ready (with some timeout). It would make more sense to swap the order of those test modules and wait until the network is ready within networking
.
I'm actually pretty sure that in these failures the networking just isn't ready. That's because the serial log shows that only the IPv6 address is assigned in the first place and the IPv4 address (which we rely on here) is only assigned later:
[ 406.246840][ T1] reboot: Restarting system [ 15.567090][ T1] systemd[1]: Failed to transition into init label 'system_u:system_r:init_t:s0', ignoring. Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0). SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA) SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA) SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519) SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA) ens4: fe80::5054:ff:fe12:38 localhost login: Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0). SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA) SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA) SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519) SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA) ens4: 10.0.2.16 fe80::5054:ff:fe12:38 localhost login: YpuUt-0-
The log message without IPv4 address is also visible in screenshots after the failing curl
invocation (https://openqa.opensuse.org/tests/2246907#step/textinfo/16) and the log message with IPv4 address only appears shortly later (https://openqa.opensuse.org/tests/2246907#step/textinfo/18).
So I would conclude that tests should simply wait until the network is ready (including IPv4) before using it. Hence I'm moving the ticket back to openQA tests.
#10
Updated by jlausuch 3 months ago
To give some more input, this is how it looks in SLE Micro: https://openqa.suse.de/tests/8429489#step/textinfo/14
I'm keeping this into my radar to investigate when I have time.
If it's sporadic, please lower down the prio.
#11
Updated by okurz 3 months ago
- Subject changed from [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
- Category set to Bugs in existing tests
- Assignee set to acarvajal
mkittler you should be aware that removing the target version and not assigning the ticket to a team (with an according keyword in the subject) while keeping a "[" in the subject will likely lead to the ticket being overlooked for long, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging . You gave good suggestions what could be done except that the "networking" test module is microos specific, the test module "textinfo" is more generic, so moving one after the other is not something that works in all cases.
acarvajal assigning to you as test module maintainer. Could you please update the ticket with the right keyword for "diamond assurance"?