coordination #65118
closed[epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons
Description
Observation¶
This issue on multimachine tests is happening every day, I checked briefly openvswitch and tap
devices, I think there is everything fine.
But there is websocked error https://openqa.suse.de/tests/4072150/file/autoinst-log.txt
[2020-04-01T06:24:30.424 UTC] [debug] terminating command server 4670 because test execution ended
[2020-04-01T06:24:30.424 UTC] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20083/S0zCHTSjnanXpNnc/broadcast
[2020-04-01T06:24:30.429 UTC] [debug] isotovideo: unable to inform websocket clients about stopping command server: Connection refused at /usr/bin/isotovideo line 175.
openQA test in scenario sle-15-SP1-Server-DVD-Updates-x86_64-qam-rsync-client@64bit fails in
rsync_client
Test suite description¶
Maintainer: klorenzova
Reproducible¶
Fails since (at least) Build 20200401-1
Expected result¶
Last good: 20200331-2 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by openqa_review over 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: qam-vsftpd-client
https://openqa.suse.de/tests/4118883
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by okurz over 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: qam-salt-minion
https://openqa.suse.de/tests/4196266
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by okurz over 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4256295
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by apappas over 4 years ago
- Related to action #65366: [qe-core][qam] [sporadic] test fails with `mydie at /usr/lib/os-autoinst/lockapi.pm line 41.` added
Updated by okurz over 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4338220
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by okurz over 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4388463
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by riafarov over 4 years ago
- Project changed from openQA Tests to openQA Project
- Subject changed from [tools] multimachine test fails due to websocket refusing connection to multimachine test fails due to websocket refusing connection
- Category deleted (
Bugs in existing tests)
Setting correct project
Updated by okurz over 4 years ago
- Category set to Regressions/Crashes
- Status changed from New to Rejected
- Assignee set to okurz
- Target version set to Ready
The mentioned job https://openqa.suse.de/tests/4388463 yields 404, https://openqa.suse.de/tests/4499394 is the latest in the original scenario and it's passed with a long history of jobs passed in https://openqa.suse.de/tests/4499394#next_previous . Unfeasible to fix if no reproduced issues are found.
Updated by dzedro over 4 years ago
- Status changed from Rejected to New
Don't reject issue which is happening every day! I can put tens of failed MM tests every day.
https://openqa.suse.de/tests/4499216
https://openqa.suse.de/tests/4499450
Updated by okurz over 4 years ago
- Status changed from New to In Progress
- Priority changed from Urgent to High
Please act like a professional. I stated that I could not find reproduced issues. And if this issue is persisting since 4 months you obviously have found a good workaround so I doubt the issue suddenly becomes "Urgent" just because you don't like my ticket update.
The jobs you linked are "parallel_failed" so they are not reflecting the source problem. I can see that in both cases the jobs are "failed" and were running on openqaworker3:14, e.g. for https://openqa.suse.de/tests/4499279 and https://openqa.suse.de/tests/4499451 . In both cases there is an error in the log like "Failed to run dbus command 'set_vlan' with arguments 'tap13 18' : 'tap13' is not connected to bridge 'br1'". The reason is that the worker instance 14 should not even be running according to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L270 . The error handling should be improved by https://github.com/os-autoinst/os-autoinst/pull/1477 for #66376 . The systemd service on the worker should not even be enabled but it is:
okurz@openqaworker3:~> sudo systemctl status openqa-worker@14
● openqa-worker@14.service - openQA Worker #14
Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker@.service.d
└─override.conf
Active: active (running) since Wed 2020-07-29 07:27:26 CEST; 5h 42min ago
why it has been started 5h ago I am not sure but it's the same problem as in #66376 . Did now sudo systemctl disable --now openqa-worker@{14..16}
Updated by dzedro over 4 years ago
It became urgent because it's every day issue since it has been created and you literally asked for it with your professionalism.
I don't think it's related to one worker, I have seen it on "all" workers.
https://openqa.suse.de/tests/4499071
https://openqa.suse.de/tests/4499158
Updated by okurz over 4 years ago
At least I think we can crosscheck for worker instances that are outside the correctly configured range, e.g. on each worker:
numworkers=$(sed -n "s/# numofworkers: //p" /etc/openqa/workers.ini)
for i in $(sudo systemctl --no-legend list-units -t service --state=active openqa-worker@\* | sed -n "s/^.*@\([0-9]\+\).*$/\1/p"); do test $i -gt $numworkers && echo "openqa-worker@$i active but higher than numworkers: $numworkers"; done
Executing that within a salt cmd.run
is tricky and will not be parsed correctly so I put that into https://w3.suse.de/~okurz/check_num_openqa_workers and call with sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'curl -s https://w3.suse.de/~okurz/check_num_openqa_workers | sh -'
which confirms that there are no more superfluous worker instances after stopping the ones on openqaworker3.
For the mentioned jobs from #65118#note-12 both jobs are "parallel_failed" so we have to look for the error on the parallel jobs. For the first the parallel one is https://openqa.suse.de/tests/4499072 which ran on openqaworker3:14 , i.e. the misconfigured one which shouldn't happen anymore until something/someone restarts wrong worker instances. Also the autoinst-log.txt in https://openqa.suse.de/tests/4499072/file/autoinst-log.txt shows the expected error message "Failed to run dbus command 'set_vlan' with arguments 'tap13 17' : 'tap13' is not connected to bridge 'br1'".
The other job stopped in https://openqa.suse.de/tests/4499165#step/smt_client1/10 with "# Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41." and no dbus error. Details from autoinst-log.txt:
[2020-07-29T06:07:29.616 CEST] [debug] <<< testapi::record_info(title="Paused", output="Wait for smt_setup (on parent job)", result="ok")
[2020-07-29T06:07:29.691 CEST] [info] ::: lockapi::_try_lock: Retry 1 of 7...
[2020-07-29T06:07:39.758 CEST] [info] ::: lockapi::_try_lock: Retry 2 of 7...
[2020-07-29T06:07:49.828 CEST] [info] ::: lockapi::_try_lock: Retry 3 of 7...
[2020-07-29T06:07:59.916 CEST] [info] ::: lockapi::_try_lock: Retry 4 of 7...
[2020-07-29T06:08:10.001 CEST] [info] ::: lockapi::_try_lock: Retry 5 of 7...
[2020-07-29T06:08:20.118 CEST] [info] ::: lockapi::_try_lock: Retry 6 of 7...
[2020-07-29T06:08:30.207 CEST] [info] ::: lockapi::_try_lock: Retry 7 of 7...
[2020-07-29T06:08:40.208 CEST] [debug] tests/smt/smt_client1.pm:32 called lockapi::barrier_wait
[2020-07-29T06:08:40.208 CEST] [debug] <<< bmwqemu::mydie(cause_of_death="barrier 'smt_setup': lock owner already finished")
[2020-07-29T06:08:40.297 CEST] [info] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41.
That I don't understand yet but for sure it is not the same problem as for the other job.
With the SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-07-20' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id;
I look for all jobs with the ticket id in comment. To exclude openqaworker3 which was ill-configured I am using the SQL query:
select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-07-20' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id and host != 'openqaworker3';
which shows me 40 jobs right now. The latest job failed in https://openqa.suse.de/tests/4508248#step/boot_to_desktop/5 in the scenario "sle-12-SP4-Server-DVD-Updates-x86_64-mru-install-multipath-remote@64bit" with details
[2020-07-30T21:28:41.047 CEST] [debug] <<< testapi::send_key(key="ret", do_wait=0, wait_screen_change=0)
[2020-07-30T21:28:41.315 CEST] [debug] tests/boot/boot_to_desktop.pm:42 called opensusebasetest::wait_boot -> lib/opensusebasetest.pm:1022 called opensusebasetest::wait_boot_past_bootloader -> lib/opensusebasetest.pm:920 called opensusebasetest::handle_displaymanager_login -> lib/opensusebasetest.pm:753 called testapi::assert_screen
[2020-07-30T21:28:41.316 CEST] [debug] <<< testapi::assert_screen(mustmatch=[
"displaymanager",
"emergency-shell",
"emergency-mode"
], timeout=300)
…
[0m[32m[2020-07-30T21:33:49.497 CEST] [debug] >>> testapi::_check_backend_response: match=displaymanager,emergency-mode,emergency-shell timed out after 300 (assert_screen)
[0m[33m[2020-07-30T21:33:49.539 CEST] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'displaymanager, emergency-shell, emergency-mode' matched
while it looks like the system is still stuck in the PXE boot menu not acting on the "ret" that was pressed. I do not see the relation to the problem described in before at all.
@dzedro can you explain why you think that for example https://openqa.suse.de/tests/4508248# also fails due to problems of the network between the SUT VMs?
Updated by okurz over 4 years ago
- Status changed from In Progress to Feedback
@dzedro can you explain why you think that for example https://openqa.suse.de/tests/4508248# also fails due to problems of the network between the SUT VMs?
Updated by dzedro over 4 years ago
Because it's MM and the failure (websocket refusing connection) is always on parent, often supportserver or just the test to which other test is PARALLEL_WITH.
Updated by okurz over 4 years ago
- Related to action #50225: Make JOB_TIMEOUT incompletes more obvious added
Updated by okurz over 4 years ago
- Related to action #52997: [sle][functional][tools]test fails in sshd - timeout_exceeded (13:20 hours) added
Updated by okurz over 4 years ago
- Related to action #58379: isotovideo is slow to shutdown / error messages on proper shutdown added
Updated by okurz over 4 years ago
- Related to action #57707: isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b added
Updated by okurz over 4 years ago
- Status changed from Feedback to In Progress
ok, I see. So the message "isotovideo: unable to inform websocket clients about stopping command server: Connection refused at /usr/bin/isotovideo line 175." is the one that you are concerned about, same as you wrote in the original description. So far I was sure that the message is only confusing but a followup of the actual error, not the cause. But then at least we can try to improve that. I added some related tickets around the same problem and will see what I can do to find out if that is the real problem, improve the error message and get the problems in these MM tests fixed.
Updated by mkittler over 4 years ago
I can confirm what @okurz said:
The message "isotovideo: unable to inform websocket clients…" is definitely not the cause of any problems besides avoidable error messages within the developer mode. This message would never cause a job to fail (be it a MM test or not).
However, a MM test which failed because a parallel job failed is indeed likely to contain the "isotovideo: unable to inform websocket clients…" message. That is because such a MM test is stopped by the worker which messes with the usual termination of isotovideo (to ensure there are no leftover processes). Hence isotovideo's web socket server might have already been terminated at this point. So if you want to know why your MM test failed this message is not a clue.
Not sure how to improve the error message to be less confusing. A simple solution would be to always also log additionally: "This only affects the developer mode and is unrelated to the test execution."
I've already tried that but only covered one place where the error is possibly logged and I suppose my phrasing wasn't too helpful as well.
Another solution would be to suppress the error message completely but I don't like that because it can make debugging extra-hard.
Updated by dzedro over 4 years ago
Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...
Updated by okurz over 4 years ago
dzedro wrote:
Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...
but this is what the subject line of your ticket says: "multimachine test fails due to websocket refusing connection" …
@mkittler You created https://github.com/os-autoinst/os-autoinst/pull/1500 . We should also keep in mind our history for related work, just to not run into regressions or back-and-forth: #45191 and #57707
What I found as reproducer for a "Connection refused" message started from os-autoinst/t/data/
:
timeout -s INT -v 4 ../../isotovideo -d casedir=/home/okurz/local/os-autoinst/os-autoinst/t/data/tests
yielding
[2020-08-05T20:20:28.158 CEST] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
[2020-08-05T20:20:28.175 CEST] [debug] no match: 14.9s, best candidate: boot-core (0.00)
[2020-08-05T20:20:28.276 CEST] [debug] no change: 14.8s
…
[2020-08-05T20:20:28.819 CEST] [debug] no change: 14.3s
timeout: sending signal INT to command ‘../../isotovideo’
[2020-08-05T20:20:28.839 CEST] [debug] isotovideo received signal INT
[2020-08-05T20:20:28.840 CEST] [debug] isotovideo received signal INT
[2020-08-05T20:20:28.840 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 2 from pid 27273 (timeout)
[2020-08-05T20:20:28.841 CEST] [debug] no change: 14.3s
[2020-08-05T20:20:28.842 CEST] [debug] terminating command server 27278 because test execution ended
[2020-08-05T20:20:28.842 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/PYkrRjvtBG/broadcast
[2020-08-05T20:20:28.848 CEST] [debug] commands process exited: 0
[2020-08-05T20:20:28.851 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Connection refused at ../../isotovideo line 175.
[2020-08-05T20:20:28.851 CEST] [debug] [autotest] process exited: 0
[2020-08-05T20:20:28.852 CEST] [debug] done with command server
[2020-08-05T20:20:28.852 CEST] [debug] stopping autotest process 27281
[2020-08-05T20:20:28.852 CEST] [debug] done with autotest process
[2020-08-05T20:20:28.852 CEST] [debug] isotovideo failed
[2020-08-05T20:20:28.853 CEST] [debug] stopping backend process 27298
[2020-08-05T20:20:28.854 CEST] [debug] backend got TERM
[2020-08-05T20:20:28.854 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2020-08-05T20:20:30.866 CEST] [debug] flushing frames
[2020-08-05T20:20:30.870 CEST] [debug] sending magic and exit
[2020-08-05T20:20:31.059 CEST] [debug] done with backend process
27274: EXIT 1
Created https://github.com/os-autoinst/os-autoinst/pull/1501 with my ideas.
Updated by dzedro over 4 years ago
okurz wrote:
dzedro wrote:
Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...but this is what the subject line of your ticket says: "multimachine test fails due to websocket refusing connection" …
@mkittler You created https://github.com/os-autoinst/os-autoinst/pull/1500 . We should also keep in mind our history for related work, just to not run into regressions or back-and-forth: #45191 and #57707
Yes, because all this MM failures have it in log, if there is root cause which is causing this or another description which is describing the issue more accurate then great.
Updated by okurz over 4 years ago
- Related to action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added
Updated by okurz over 4 years ago
- Related to action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retry added
Updated by okurz over 4 years ago
- Subject changed from multimachine test fails due to websocket refusing connection to [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons
- Status changed from Feedback to In Progress
With #66376 done the feedback on incorrectly configured worker instances should be better. I have created the specific ticket #69694 so that auto-review can catch the issue and trigger automatic retries. We can try to solve the specific issue in #69694. Also with https://github.com/os-autoinst/os-autoinst/pull/1501 merged and deployed there are no more "Connection refused" messages expected. Based on how dzedro uses this ticket as more generic label I consider this an "epic" that might not even be solvable for good. We can still look in the database where the label is used. In some cases it is clearly wrong, e.g. https://openqa.suse.de/tests/4536070#step/iscsi_server/63 that now shows a needle mismatch. I created a corresponding new needle and retriggered as https://openqa.suse.de/tests/4536556.
EDIT: The last passed. Querying the database again shows currently:
4535778 | 2020-08-07 01:54:50 | done | failed | mru-install-multipath-remote | | openqaworker6
4535890 | 2020-08-07 03:05:50 | done | failed | qam-rsync-client | | openqaworker10
4537018 | 2020-08-07 13:55:58 | done | failed | mru-install-multipath-remote | | openqaworker8
4535890 failed in https://openqa.suse.de/tests/4535890#step/rsync_client/10 with # Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41.
which could certainly be more helpful -> created feature request #69715 which we should not be concerned with right now as this would be only UX improvement anyway.
Comparing logs from server and client in parallel with vim -O <(curl -s https://openqa.suse.de/tests/4535818/file/autoinst-log.txt) <(curl -s https://openqa.suse.de/tests/4535890/file/autoinst-log.txt)
shows the following flow:
client: [2020-08-07T02:53:11.526 UTC] [debug] barrier wait 'rsync_setup'
…
client: [2020-08-07T02:54:22.190 UTC] [debug] <<< bmwqemu::mydie(cause_of_death="barrier 'rsync_setup': lock owner already finished")
…
server: [2020-08-07T05:00:54.528 CEST] [debug] barrier wait 'rsync_setup'
…
server: [2020-08-07T05:05:50.640 CEST] [debug] autotest received signal TERM, saving results of current test before exiting
note that the timezone differs on the two worker machines that have been used, UTC on openqaworker10 running the client and CEST on openqaworker8 running the server. This had been first reported in #62309#note-3 but not further followed on -> created #69718 for timezone harmonisation
What can be seen is that the client started the barrier wait but only waited until 02:54:22Z but the server only started the wait at 03:00:54Z, that is 6 mins later. The client job was still running at this time and only ended 03:05Z which also triggered the termination of the server job but this was merely the post_fail_hook executing, the waiting for synchronisation already failed way before. This is expected behaviour from side of openQA and really a problem in the design of the test -> #69787
Updated by okurz over 4 years ago
- Due date set to 2020-09-01
- Status changed from In Progress to Feedback
Have reported multiple specific issues to follow up with and also relabeled some tests and introduced auto-review tags for easier handling. After my vacation I can check for left cases.
Updated by dzedro about 4 years ago
I added workaround on QAM MM jobs, made MM jobs run on same worker with WORKER_CLASS=openqaworker. Tried to split it between all MM workers to not load only on and possibly delay the runs. Now there are no or much less failures.
Updated by okurz about 4 years ago
- Due date changed from 2020-09-01 to 2020-09-30
The SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-09-01' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id;
returned no jobs marked with this ticket anymore.
@dzedro would it be ok if you can remove the specific worker pinning again and we can check if there are any additional issues and then decide what to do?
Updated by dzedro about 4 years ago
I removed the workaround from few MM tests.
Updated by dzedro about 4 years ago
Removed workaround on all tests, could see failures related to this issue.
Updated by okurz about 4 years ago
With the SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-09-14' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id;
I found no jobs labeled with this ticket since 2020-09-14, have you found any?
Updated by okurz about 4 years ago
- Status changed from Feedback to Resolved
Seems we are ok here and we call the generic issue "Resolved". We still have open related issues that we identified.
Updated by szarate about 4 years ago
- Tracker changed from action to coordination
Updated by szarate about 4 years ago
See for the reason of tracker change: http://mailman.suse.de/mailman/private/qa-sle/2020-October/002722.html
Updated by okurz almost 4 years ago
- Related to action #80570: [qe-core][qem][sporadic] test fails in smt_client1 auto_review:"mydie.*acquiring barrier 'smt_setup': lock owner already finished":retry added