Project

General

Profile

action #108464

[diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M

Added by dimstar 4 months ago. Updated 3 months ago.

Status:
Workable
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2022-03-16
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario microos-Tumbleweed-DVD-x86_64-remote_ssh_target@64bit fails in
textinfo

curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo; echo 7EEdq-$?-
curl: (7) Couldn't connect to server
7EEdq-7-

This has been failing quite frequently in this test lately - 'textinfo' passes reliably in all other tests.

"data/textinfo" refers to https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/data/textinfo
which is retrieved by
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/textinfo.pm
which is doing

            $self->select_serial_terminal;
            assert_script_run('curl -O ' . data_url('textinfo'));

Test suite description

Maintainer: jrivera Boot with ssh=1 parameter and wait for parallel job (remote_ssh_controller) to install the system.

Reproducible

Fails since (at least) Build 20220315

Expected result

Last good: 20220314 (or more recent)

Suggestions

Further details

Always latest result in this scenario: latest

History

#1 Updated by szarate 4 months ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from test fails in textinfo to test fails in textinfo - Downloading from command server failed right after boot
  • Category changed from Bugs in existing tests to Concrete Bugs

this looks like an sporadic issue, but I think it's worth investigating since assert_script_run('curl -O ' . data_url('textinfo')); belongs in the backend and that's a critical thing :)

# Test died: command 'curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo' failed at /usr/lib/os-autoinst/testapi.pm line 953.
    testapi::_handle_script_run_ret(7, "curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo", "fail_message", "", "timeout", 90, "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 991
    testapi::assert_script_run("curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo") called at microos/tests/console/textinfo.pm line 39
    textinfo::run(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/basetest.pm line 360
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
    basetest::runtest(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/autotest.pm line 372
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 372
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 242
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 242
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 296
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790), CODE(0x55c8b0c86640)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/os-autoinst/autotest.pm line 298
    autotest::start_process() called at /usr/bin/isotovideo line 256

#2 Updated by okurz 4 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#3 Updated by mkittler 3 months ago

The command server has definitely been started before:

[2022-03-16T19:36:37.458842+01:00] [info] cmdsrv: daemon reachable under http://*:20063/CldQCbjudrGe1EfF/
[2022-03-16T19:36:37.461389+01:00] [info] Listening at "http://[::]:20063"
Web application available at http://[::]:20063

Note that Listening at "http://[::]:20063" means it is listening on the IPv4 address and IPv6 address on all interfaces. So that shouldn't be a problem.

It also looks like further attempts to use the command server work, e.g. https://openqa.opensuse.org/tests/2247542#step/textinfo/29 and https://openqa.opensuse.org/tests/2247542#step/journal_check/14.

So maybe a bug in the specific route …/data/textinfo.

By the way, where is the test data file/directory textinfo supposed to come from? It would be useful to have this data to be able to reproduce the issue within a unit test.

#4 Updated by okurz 3 months ago

  • Subject changed from test fails in textinfo - Downloading from command server failed right after boot to [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 3 months ago

Apart from one more failure it looks like the issue isn't happening anymore.


Try to get verbose output from curl on client side

According to man curl "7" means "Failed to connect to host." (and not "Could not resolve host. The given remote host could not be resolved." which would be 6 and also not "Weird server reply. The server sent data curl could not parse." which would be 8). Considering the list of errors in the man page the error codes are very fine grained so I don't think adding a verbose flag to the curl invocation will help much. We have to find out why curl is unable to reach the command server. Maybe the SUT's networking is configured wrongly at this point. (Since uploading works later I suppose it cannot be completely/permanently broken.)

maybe also verbose information from command server on os-autoinst side

Likely also not very helpful it the command server cannot even be reached.

#7 Updated by mkittler 3 months ago

In the good case it looks the same:

[2022-03-16T21:31:39.110676+01:00] [info] cmdsrv: daemon reachable under http://*:20023/TZR66YKczH0t7Osh/
[2022-03-16T21:31:39.112762+01:00] [info] Listening at "http://[::]:20023"
Web application available at http://[::]:20023
…
[2022-03-16T21:41:47.779928+01:00] [debug] tests/console/textinfo.pm:39 called testapi::assert_script_run
[2022-03-16T21:41:47.780093+01:00] [debug] <<< testapi::assert_script_run(cmd="curl -O http://10.0.2.2:20023/TZR66YKczH0t7Osh/data/textinfo", timeout=90, quiet=undef, fail_message="")

Except that the download actually works: https://openqa.opensuse.org/tests/2247711#step/textinfo/14

#8 Updated by mkittler 3 months ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (Concrete Bugs)
  • Assignee deleted (mkittler)

In this test the system is setup by a remote target. I don't think this should make a difference. However, it doesn't look like there's any attempt made to wait until the network is ready on the machine. It just waits for the login prompt and than immediately switches to the serial console attempting the download:

     $self->select_serial_terminal;
     assert_script_run('curl -O ' . data_url('textinfo'));

I find it a bit weird that there actually is a test module networking to test whether networking works but it only runs after textinfo and doesn't wait until the network is actually ready (with some timeout). It would make more sense to swap the order of those test modules and wait until the network is ready within networking.

I'm actually pretty sure that in these failures the networking just isn't ready. That's because the serial log shows that only the IPv6 address is assigned in the first place and the IPv4 address (which we rely on here) is only assigned later:

[  406.246840][    T1] reboot: Restarting system
[   15.567090][    T1] systemd[1]: Failed to transition into init label 'system_u:system_r:init_t:s0', ignoring.



Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0).

SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA)
SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA)
SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519)
SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA)
ens4:  fe80::5054:ff:fe12:38


localhost login: 


Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0).

SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA)
SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA)
SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519)
SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA)
ens4: 10.0.2.16 fe80::5054:ff:fe12:38


localhost login: YpuUt-0-

The log message without IPv4 address is also visible in screenshots after the failing curl invocation (https://openqa.opensuse.org/tests/2246907#step/textinfo/16) and the log message with IPv4 address only appears shortly later (https://openqa.opensuse.org/tests/2246907#step/textinfo/18).

So I would conclude that tests should simply wait until the network is ready (including IPv4) before using it. Hence I'm moving the ticket back to openQA tests.

#9 Updated by mkittler 3 months ago

  • Description updated (diff)
  • Target version deleted (Ready)

#10 Updated by jlausuch 3 months ago

To give some more input, this is how it looks in SLE Micro: https://openqa.suse.de/tests/8429489#step/textinfo/14
I'm keeping this into my radar to investigate when I have time.
If it's sporadic, please lower down the prio.

#11 Updated by okurz 3 months ago

  • Subject changed from [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
  • Category set to Bugs in existing tests
  • Assignee set to acarvajal

mkittler you should be aware that removing the target version and not assigning the ticket to a team (with an according keyword in the subject) while keeping a "[" in the subject will likely lead to the ticket being overlooked for long, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging . You gave good suggestions what could be done except that the "networking" test module is microos specific, the test module "textinfo" is more generic, so moving one after the other is not something that works in all cases.

acarvajal assigning to you as test module maintainer. Could you please update the ticket with the right keyword for "diamond assurance"?

#12 Updated by mkittler 3 months ago

I've pinged the maintainer and a few other people in the chat. I haven't just left the ticket hanging.

Also available in: Atom PDF