Project

General

Profile

Actions

action #108464

closed

[sporadic][y] test fails in textinfo - Downloading from command server failed right after boot size:M

Added by dimstar about 2 years ago. Updated 2 months ago.

Status:
Rejected
Priority:
Low
Assignee:
-
Target version:
-
Start date:
2022-03-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario microos-Tumbleweed-DVD-x86_64-remote_ssh_target@64bit fails in
textinfo

curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo; echo 7EEdq-$?-
curl: (7) Couldn't connect to server
7EEdq-7-

This has been failing quite frequently in this test lately - 'textinfo' passes reliably in all other tests.

"data/textinfo" refers to https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/data/textinfo
which is retrieved by
https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/textinfo.pm
which is doing

            $self->select_serial_terminal;
            assert_script_run('curl -O ' . data_url('textinfo'));

Test suite description

Maintainer: jrivera Boot with ssh=1 parameter and wait for parallel job (remote_ssh_controller) to install the system.

Reproducible

Fails since (at least) Build 20220315

Expected result

Last good: 20220314 (or more recent)

Suggestions

Further details

Always latest result in this scenario: latest

Actions #1

Updated by szarate about 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from test fails in textinfo to test fails in textinfo - Downloading from command server failed right after boot
  • Category changed from Bugs in existing tests to Regressions/Crashes

this looks like an sporadic issue, but I think it's worth investigating since assert_script_run('curl -O ' . data_url('textinfo')); belongs in the backend and that's a critical thing :)

# Test died: command 'curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo' failed at /usr/lib/os-autoinst/testapi.pm line 953.
    testapi::_handle_script_run_ret(7, "curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo", "fail_message", "", "timeout", 90, "quiet", undef) called at /usr/lib/os-autoinst/testapi.pm line 991
    testapi::assert_script_run("curl -O http://10.0.2.2:20063/CldQCbjudrGe1EfF/data/textinfo") called at microos/tests/console/textinfo.pm line 39
    textinfo::run(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/basetest.pm line 360
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
    basetest::runtest(textinfo=HASH(0x55c8af23c578)) called at /usr/lib/os-autoinst/autotest.pm line 372
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 372
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 242
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 242
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 296
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 326
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790), CODE(0x55c8b0c86640)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 488
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x55c8b0bc8790)) called at /usr/lib/os-autoinst/autotest.pm line 298
    autotest::start_process() called at /usr/bin/isotovideo line 256
Actions #2

Updated by okurz about 2 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by mkittler about 2 years ago

The command server has definitely been started before:

[2022-03-16T19:36:37.458842+01:00] [info] cmdsrv: daemon reachable under http://*:20063/CldQCbjudrGe1EfF/
[2022-03-16T19:36:37.461389+01:00] [info] Listening at "http://[::]:20063"
Web application available at http://[::]:20063

Note that Listening at "http://[::]:20063" means it is listening on the IPv4 address and IPv6 address on all interfaces. So that shouldn't be a problem.

It also looks like further attempts to use the command server work, e.g. https://openqa.opensuse.org/tests/2247542#step/textinfo/29 and https://openqa.opensuse.org/tests/2247542#step/journal_check/14.

So maybe a bug in the specific route …/data/textinfo.

By the way, where is the test data file/directory textinfo supposed to come from? It would be useful to have this data to be able to reproduce the issue within a unit test.

Actions #4

Updated by okurz about 2 years ago

  • Subject changed from test fails in textinfo - Downloading from command server failed right after boot to [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #6

Updated by mkittler about 2 years ago

Apart from one more failure it looks like the issue isn't happening anymore.


Try to get verbose output from curl on client side

According to man curl "7" means "Failed to connect to host." (and not "Could not resolve host. The given remote host could not be resolved." which would be 6 and also not "Weird server reply. The server sent data curl could not parse." which would be 8). Considering the list of errors in the man page the error codes are very fine grained so I don't think adding a verbose flag to the curl invocation will help much. We have to find out why curl is unable to reach the command server. Maybe the SUT's networking is configured wrongly at this point. (Since uploading works later I suppose it cannot be completely/permanently broken.)

maybe also verbose information from command server on os-autoinst side

Likely also not very helpful it the command server cannot even be reached.

Actions #7

Updated by mkittler about 2 years ago

In the good case it looks the same:

[2022-03-16T21:31:39.110676+01:00] [info] cmdsrv: daemon reachable under http://*:20023/TZR66YKczH0t7Osh/
[2022-03-16T21:31:39.112762+01:00] [info] Listening at "http://[::]:20023"
Web application available at http://[::]:20023
…
[2022-03-16T21:41:47.779928+01:00] [debug] tests/console/textinfo.pm:39 called testapi::assert_script_run
[2022-03-16T21:41:47.780093+01:00] [debug] <<< testapi::assert_script_run(cmd="curl -O http://10.0.2.2:20023/TZR66YKczH0t7Osh/data/textinfo", timeout=90, quiet=undef, fail_message="")

Except that the download actually works: https://openqa.opensuse.org/tests/2247711#step/textinfo/14

Actions #8

Updated by mkittler about 2 years ago

  • Project changed from openQA Project to openQA Tests
  • Category deleted (Regressions/Crashes)
  • Assignee deleted (mkittler)

In this test the system is setup by a remote target. I don't think this should make a difference. However, it doesn't look like there's any attempt made to wait until the network is ready on the machine. It just waits for the login prompt and than immediately switches to the serial console attempting the download:

     $self->select_serial_terminal;
     assert_script_run('curl -O ' . data_url('textinfo'));

I find it a bit weird that there actually is a test module networking to test whether networking works but it only runs after textinfo and doesn't wait until the network is actually ready (with some timeout). It would make more sense to swap the order of those test modules and wait until the network is ready within networking.

I'm actually pretty sure that in these failures the networking just isn't ready. That's because the serial log shows that only the IPv6 address is assigned in the first place and the IPv4 address (which we rely on here) is only assigned later:

[  406.246840][    T1] reboot: Restarting system
[   15.567090][    T1] systemd[1]: Failed to transition into init label 'system_u:system_r:init_t:s0', ignoring.



Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0).

SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA)
SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA)
SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519)
SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA)
ens4:  fe80::5054:ff:fe12:38


localhost login: 


Welcome to openSUSE MicroOS (x86_64) - Kernel 5.16.14-1-default (ttyS0).

SSH host key: SHA256:r2d0xtLtSNQOPt0VhMBodwl05cjT9Gcq1of80ZVLUHo (DSA)
SSH host key: SHA256:Y031XUds/7W74LT9hY6oO9LAtIbv118o6tRPvXF486w (ECDSA)
SSH host key: SHA256:HrjLbuj6HDWGvevGCSaG92q1MSTBlXk6AcWsvJQLExA (ED25519)
SSH host key: SHA256:GxJZDfGFv71X20sDKPg36fuJRDyEZGJiBVDQxXG1eX8 (RSA)
ens4: 10.0.2.16 fe80::5054:ff:fe12:38


localhost login: YpuUt-0-

The log message without IPv4 address is also visible in screenshots after the failing curl invocation (https://openqa.opensuse.org/tests/2246907#step/textinfo/16) and the log message with IPv4 address only appears shortly later (https://openqa.opensuse.org/tests/2246907#step/textinfo/18).

So I would conclude that tests should simply wait until the network is ready (including IPv4) before using it. Hence I'm moving the ticket back to openQA tests.

Actions #9

Updated by mkittler about 2 years ago

  • Description updated (diff)
  • Target version deleted (Ready)
Actions #10

Updated by jlausuch about 2 years ago

To give some more input, this is how it looks in SLE Micro: https://openqa.suse.de/tests/8429489#step/textinfo/14
I'm keeping this into my radar to investigate when I have time.
If it's sporadic, please lower down the prio.

Actions #11

Updated by okurz about 2 years ago

  • Subject changed from [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
  • Category set to Bugs in existing tests
  • Assignee set to acarvajal

@mkittler you should be aware that removing the target version and not assigning the ticket to a team (with an according keyword in the subject) while keeping a "[" in the subject will likely lead to the ticket being overlooked for long, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging . You gave good suggestions what could be done except that the "networking" test module is microos specific, the test module "textinfo" is more generic, so moving one after the other is not something that works in all cases.

@acarvajal assigning to you as test module maintainer. Could you please update the ticket with the right keyword for "diamond assurance"?

Actions #12

Updated by mkittler about 2 years ago

I've pinged the maintainer and a few other people in the chat. I haven't just left the ticket hanging.

Actions #13

Updated by szarate almost 2 years ago

These tickets are not on high prio

Actions #14

Updated by szarate almost 2 years ago

  • Tags set to bulkupdate

These tickets are not on high pro

Actions #15

Updated by acarvajal almost 2 years ago

  • Subject changed from [diamond assurance][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [qe-workload-testing][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
Actions #16

Updated by szarate almost 2 years ago

  • Priority changed from High to Normal
Actions #17

Updated by acarvajal almost 2 years ago

  • Due date set to 2022-07-22

okurz wrote:

@mkittler you should be aware that removing the target version and not assigning the ticket to a team (with an according keyword in the subject) while keeping a "[" in the subject will likely lead to the ticket being overlooked for long, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging . You gave good suggestions what could be done except that the "networking" test module is microos specific, the test module "textinfo" is more generic, so moving one after the other is not something that works in all cases.

I see in the latest results that this module has not been failing in the linked scenario for at least 3 months, and the last change on the module predates even that date (commit bc26238d4db8ccde9700f6c86c700d32108dd629 from Wed Oct 20 10:00:37 2021 +0200).

I would propose closing this ticket for the time being, and re-opening/creating a new ticket if the issue appears again. As it is, there is nothing to debug or fix FMPOV.

@acarvajal assigning to you as test module maintainer. Could you please update the ticket with the right keyword for "diamond assurance"?

I added the keyword earlier today, but I am thinking on removing it since this is not a module that was used in Diamond Assurance/Workload Testing. In fact, as I told @mkittler via chat, the only reason I appear as maintainer is that there was no maintainer for the module back in 2017 when I adapted it to work on SLES for SAP, and I added myself: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3655#discussion_r143339417

Unless any of you guys have any objection, I will reject & close the ticket next week.

Actions #18

Updated by acarvajal almost 2 years ago

  • Subject changed from [qe-workload-testing][sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M
Actions #19

Updated by okurz almost 2 years ago

acarvajal wrote:

I added the keyword earlier today, but I am thinking on removing it since this is not a module that was used in Diamond Assurance/Workload Testing. In fact, as I told @mkittler via chat, the only reason I appear as maintainer is that there was no maintainer for the module back in 2017 when I adapted it to work on SLES for SAP, and I added myself: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3655#discussion_r143339417

Well, if you don't maintain test modules please make sure there is another better maintainer listed then so that we don't bother you again in the future with requests to you :)

Regarding the keyword "[qe-workload-testing]", may I add this to https://progress.opensuse.org/projects/suseqa/ then?

Unless any of you guys have any objection, I will reject & close the ticket next week.

Well, thank you for checking on the history of job results. I can confirm that it seems the situation has somewhat improved and the test has not failed since some weeks but the observation from mkittler https://progress.opensuse.org/issues/108464#note-8 still applies and I would like someone to do something about this to not waste the brain cycles that have been invested so far. As a last resort we would take over the ticket as QE Tools team again. Please look into the above mentioned suggestions though before assiging to us or me.

Actions #20

Updated by acarvajal almost 2 years ago

  • Due date deleted (2022-07-22)
  • Assignee changed from acarvajal to JERiveraMoya

okurz wrote:

acarvajal wrote:

I added the keyword earlier today, but I am thinking on removing it since this is not a module that was used in Diamond Assurance/Workload Testing. In fact, as I told @mkittler via chat, the only reason I appear as maintainer is that there was no maintainer for the module back in 2017 when I adapted it to work on SLES for SAP, and I added myself: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3655#discussion_r143339417

Well, if you don't maintain test modules please make sure there is another better maintainer listed then so that we don't bother you again in the future with requests to you :)

Was this really necessary? Is the smiley at the end added to "soften the blow", because you are aware that this was not necessary? Does it help in solving the issue at hand? Does it add context information? I am aware my comment above does not help in solving the issue, but at least it adds context.

I really want to know, as IMHO this was not necessary or called for. And it is not the first time I have read similar comments from the same source.

So to reply:

  • "if you don't maintain test modules" -> Who said I did not?
  • "so that we don't bother you again in the future with requests to you" -> Who said this was bothering me?

Just to be clear: I am not bothered by requests, and I can maintain this and other modules in the repo. I am however bothered by the passive-aggressiveness with which you reply to poo's comments any time anybody adds one to a ticket.

So let's cut to the chase, and tell me exactly what you wish me to do:

  • Submit a NOOP PR "fixing" a module that for all intents and purposes is working, and link it here to give the impression that the issue was fixed? After all, the ticket was assigned to the module maintainer, suggesting that there is an issue with the module itself, when apparently that is not the actual root cause of the issue as per the analysis in https://progress.opensuse.org/issues/108464#note-8.
  • Step into another squad's test and rearrange the order of the test modules as suggested by @mkittler even though that goes beyond the scope of a test module maintainer?
  • Submit a PR assigning you as the module maintainer?
  • Another suggestion?

Regarding the keyword "[qe-workload-testing]", may I add this to https://progress.opensuse.org/projects/suseqa/ then?

I guess the actual name is SAP Workload Testing: https://confluence.suse.com/display/qasle/QE+squads+-+structure

Unless any of you guys have any objection, I will reject & close the ticket next week.

Well, thank you for checking on the history of job results. I can confirm that it seems the situation has somewhat improved and the test has not failed since some weeks but the observation from mkittler https://progress.opensuse.org/issues/108464#note-8 still applies and I would like someone to do something about this to not waste the brain cycles that have been invested so far. As a last resort we would take over the ticket as QE Tools team again. Please look into the above mentioned suggestions though before assiging to us or me.

Objection noted.

Changing the order of the test modules in the test as suggested in https://progress.opensuse.org/issues/108464#note-8 goes beyond the role of module maintainer AFAIK. Assigning this to @JERiveraMoya and clearing the due date. Joaquin: I hope that's OK; let me know if I should instead assign this to someone else.

Actions #21

Updated by JERiveraMoya almost 2 years ago

Is there any run to check the error? My name is in the test suite because we were the first to introduce MM in TW, but this is microos-Tumbleweed which is running this additional module failing.
Anyway if it is reproducible again we can take a look, might be some synchronization error and we can switch the order of the test module in that case. Adding YaST label.

Actions #22

Updated by JERiveraMoya almost 2 years ago

  • Tags changed from bulkupdate to bulkupdate, YaST
Actions #23

Updated by okurz over 1 year ago

  • Subject changed from [sporadic] test fails in textinfo - Downloading from command server failed right after boot size:M to [sporadic][y] test fails in textinfo - Downloading from command server failed right after boot size:M

acarvajal wrote:

okurz wrote:

acarvajal wrote:

I added the keyword earlier today, but I am thinking on removing it since this is not a module that was used in Diamond Assurance/Workload Testing. In fact, as I told @mkittler via chat, the only reason I appear as maintainer is that there was no maintainer for the module back in 2017 when I adapted it to work on SLES for SAP, and I added myself: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3655#discussion_r143339417

Well, if you don't maintain test modules please make sure there is another better maintainer listed then so that we don't bother you again in the future with requests to you :)

Was this really necessary? Is the smiley at the end added to "soften the blow", because you are aware that this was not necessary? Does it help in solving the issue at hand? Does it add context information? I am aware my comment above does not help in solving the issue, but at least it adds context.

I really want to know, as IMHO this was not necessary or called for. And it is not the first time I have read similar comments from the same source.

So to reply:

  • "if you don't maintain test modules" -> Who said I did not?

I am really sorry. By no mean did I want to insult or annoy you. I simply thought that you might not want to continue the maintainership especially also because you removed the team keyword from the subject line.

  • "so that we don't bother you again in the future with requests to you" -> Who said this was bothering me?

sorry, seems I misread and misunderstood your earlier comments.

Just to be clear: I am not bothered by requests, and I can maintain this and other modules in the repo. I am however bothered by the passive-aggressiveness with which you reply to poo's comments any time anybody adds one to a ticket.

Sorry, that you feel this way. I will keep this in mind and try to improve but please let me assure you that I had no such intent.

So let's cut to the chase, and tell me exactly what you wish me to do:

  • Submit a NOOP PR "fixing" a module that for all intents and purposes is working, and link it here to give the impression that the issue was fixed? After all, the ticket was assigned to the module maintainer, suggesting that there is an issue with the module itself, when apparently that is not the actual root cause of the issue as per the analysis in https://progress.opensuse.org/issues/108464#note-8.

Well, maybe it would actually help if the module checks its prerequisities. But you are right that the actual fix preventing such problem needs to happen elsewhere.

  • Step into another squad's test and rearrange the order of the test modules as suggested by @mkittler even though that goes beyond the scope of a test module maintainer?

Well, honestly I think anyone should be welcome to propose improvements, e.g. propose a code change or even directly a pull request and ask the people responsible for a certain area to give feedback. Sure, you don't have to do that as test module maintainer. But often it's easier to propose code changes yourself instead of trying to explain in English rather than code to someone else what would need to be done.

  • Submit a PR assigning you as the module maintainer?

I don't think I can be a good module maintainer here

  • Another suggestion?

You already picked a good choice with assigning to JERiveraMoya as representative of [y] so adding the according keyword.

Regarding the keyword "[qe-workload-testing]", may I add this to https://progress.opensuse.org/projects/suseqa/ then?

I guess the actual name is SAP Workload Testing: https://confluence.suse.com/display/qasle/QE+squads+-+structure

Now I am not sure what should be added on https://progress.opensuse.org/projects/suseqa/. ticket queries there use the character range [a-z-] so "[qe-workload-testing]" or "[sap-workload-testing]" would work but maybe just "[sap-workload]" would suffice?

Unless any of you guys have any objection, I will reject & close the ticket next week.

Well, thank you for checking on the history of job results. I can confirm that it seems the situation has somewhat improved and the test has not failed since some weeks but the observation from mkittler https://progress.opensuse.org/issues/108464#note-8 still applies and I would like someone to do something about this to not waste the brain cycles that have been invested so far. As a last resort we would take over the ticket as QE Tools team again. Please look into the above mentioned suggestions though before assiging to us or me.

Objection noted.

Changing the order of the test modules in the test as suggested in https://progress.opensuse.org/issues/108464#note-8 goes beyond the role of module maintainer AFAIK. Assigning this to @JERiveraMoya and clearing the due date. Joaquin: I hope that's OK; let me know if I should instead assign this to someone else.

Actions #24

Updated by JERiveraMoya over 1 year ago

  • Tags changed from bulkupdate, YaST to bulkupdate
  • Project changed from openQA Tests to qe-yam
  • Category deleted (Bugs in existing tests)
  • Status changed from Workable to New
  • Priority changed from Normal to Low
  • Target version set to future
Actions #25

Updated by JERiveraMoya 2 months ago

  • Status changed from New to Rejected
  • Assignee deleted (JERiveraMoya)
  • Target version deleted (future)
Actions

Also available in: Atom PDF