Project

General

Profile

Actions

coordination #65118

closed

[epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons

Added by dzedro over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2020-04-01
Due date:
2020-09-30
% Done:

0%

Estimated time:

Description

Observation

This issue on multimachine tests is happening every day, I checked briefly openvswitch and tap
devices, I think there is everything fine.
But there is websocked error https://openqa.suse.de/tests/4072150/file/autoinst-log.txt
[2020-04-01T06:24:30.424 UTC] [debug] terminating command server 4670 because test execution ended
[2020-04-01T06:24:30.424 UTC] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20083/S0zCHTSjnanXpNnc/broadcast
[2020-04-01T06:24:30.429 UTC] [debug] isotovideo: unable to inform websocket clients about stopping command server: Connection refused at /usr/bin/isotovideo line 175.

openQA test in scenario sle-15-SP1-Server-DVD-Updates-x86_64-qam-rsync-client@64bit fails in
rsync_client

Test suite description

Maintainer: klorenzova

Reproducible

Fails since (at least) Build 20200401-1

Expected result

Last good: 20200331-2 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 8 (0 open8 closed)

Related to openQA Tests (public) - action #65366: [qe-core][qam] [sporadic] test fails with `mydie at /usr/lib/os-autoinst/lockapi.pm line 41.`Resolved2020-04-07

Actions
Related to openQA Project (public) - action #50225: Make JOB_TIMEOUT incompletes more obviousResolvedmkittler2019-04-10

Actions
Related to openQA Project (public) - action #52997: [sle][functional][tools]test fails in sshd - timeout_exceeded (13:20 hours)Rejectedokurz2019-06-13

Actions
Related to openQA Project (public) - action #58379: isotovideo is slow to shutdown / error messages on proper shutdownResolvedokurz2019-10-042020-04-14

Actions
Related to openQA Project (public) - action #57707: isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2bResolvedmkittler2019-10-04

Actions
Related to openQA Infrastructure (public) - action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failuresResolvedmkittler2020-02-26

Actions
Related to openQA Infrastructure (public) - action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retryResolvedokurz2020-08-072020-09-01

Actions
Related to openQA Tests (public) - action #80570: [qe-core][qem][sporadic] test fails in smt_client1 auto_review:"mydie.*acquiring barrier 'smt_setup': lock owner already finished":retryRejectedzluo

Actions
Actions #1

Updated by openqa_review over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam-vsftpd-client
https://openqa.suse.de/tests/4118883

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #2

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: qam-salt-minion
https://openqa.suse.de/tests/4196266

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #3

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4256295

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #4

Updated by apappas over 4 years ago

  • Related to action #65366: [qe-core][qam] [sporadic] test fails with `mydie at /usr/lib/os-autoinst/lockapi.pm line 41.` added
Actions #5

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4338220

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #6

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/4388463

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #7

Updated by riafarov over 4 years ago

  • Project changed from openQA Tests (public) to openQA Project (public)
  • Subject changed from [tools] multimachine test fails due to websocket refusing connection to multimachine test fails due to websocket refusing connection
  • Category deleted (Bugs in existing tests)

Setting correct project

Actions #8

Updated by okurz over 4 years ago

  • Category set to Regressions/Crashes
  • Status changed from New to Rejected
  • Assignee set to okurz
  • Target version set to Ready

The mentioned job https://openqa.suse.de/tests/4388463 yields 404, https://openqa.suse.de/tests/4499394 is the latest in the original scenario and it's passed with a long history of jobs passed in https://openqa.suse.de/tests/4499394#next_previous . Unfeasible to fix if no reproduced issues are found.

Actions #9

Updated by dzedro over 4 years ago

  • Status changed from Rejected to New

Don't reject issue which is happening every day! I can put tens of failed MM tests every day.

https://openqa.suse.de/tests/4499216
https://openqa.suse.de/tests/4499450

Actions #10

Updated by dzedro over 4 years ago

  • Priority changed from Normal to Urgent
Actions #11

Updated by okurz over 4 years ago

  • Status changed from New to In Progress
  • Priority changed from Urgent to High

Please act like a professional. I stated that I could not find reproduced issues. And if this issue is persisting since 4 months you obviously have found a good workaround so I doubt the issue suddenly becomes "Urgent" just because you don't like my ticket update.

The jobs you linked are "parallel_failed" so they are not reflecting the source problem. I can see that in both cases the jobs are "failed" and were running on openqaworker3:14, e.g. for https://openqa.suse.de/tests/4499279 and https://openqa.suse.de/tests/4499451 . In both cases there is an error in the log like "Failed to run dbus command 'set_vlan' with arguments 'tap13 18' : 'tap13' is not connected to bridge 'br1'". The reason is that the worker instance 14 should not even be running according to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/openqa/workerconf.sls#L270 . The error handling should be improved by https://github.com/os-autoinst/os-autoinst/pull/1477 for #66376 . The systemd service on the worker should not even be enabled but it is:

okurz@openqaworker3:~> sudo systemctl status openqa-worker@14
● openqa-worker@14.service - openQA Worker #14
   Loaded: loaded (/usr/lib/systemd/system/openqa-worker@.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/openqa-worker@.service.d
           └─override.conf
   Active: active (running) since Wed 2020-07-29 07:27:26 CEST; 5h 42min ago

why it has been started 5h ago I am not sure but it's the same problem as in #66376 . Did now sudo systemctl disable --now openqa-worker@{14..16}

Actions #12

Updated by dzedro over 4 years ago

It became urgent because it's every day issue since it has been created and you literally asked for it with your professionalism.
I don't think it's related to one worker, I have seen it on "all" workers.

https://openqa.suse.de/tests/4499071
https://openqa.suse.de/tests/4499158

Actions #13

Updated by okurz over 4 years ago

At least I think we can crosscheck for worker instances that are outside the correctly configured range, e.g. on each worker:

numworkers=$(sed -n "s/# numofworkers: //p" /etc/openqa/workers.ini)
for i in $(sudo systemctl --no-legend list-units -t service --state=active openqa-worker@\* | sed -n "s/^.*@\([0-9]\+\).*$/\1/p"); do test $i -gt $numworkers && echo "openqa-worker@$i active but higher than numworkers: $numworkers"; done

Executing that within a salt cmd.run is tricky and will not be parsed correctly so I put that into https://w3.suse.de/~okurz/check_num_openqa_workers and call with sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'curl -s https://w3.suse.de/~okurz/check_num_openqa_workers | sh -' which confirms that there are no more superfluous worker instances after stopping the ones on openqaworker3.

For the mentioned jobs from #65118#note-12 both jobs are "parallel_failed" so we have to look for the error on the parallel jobs. For the first the parallel one is https://openqa.suse.de/tests/4499072 which ran on openqaworker3:14 , i.e. the misconfigured one which shouldn't happen anymore until something/someone restarts wrong worker instances. Also the autoinst-log.txt in https://openqa.suse.de/tests/4499072/file/autoinst-log.txt shows the expected error message "Failed to run dbus command 'set_vlan' with arguments 'tap13 17' : 'tap13' is not connected to bridge 'br1'".

The other job stopped in https://openqa.suse.de/tests/4499165#step/smt_client1/10 with "# Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41." and no dbus error. Details from autoinst-log.txt:

[2020-07-29T06:07:29.616 CEST] [debug] <<< testapi::record_info(title="Paused", output="Wait for smt_setup (on parent job)", result="ok")
[2020-07-29T06:07:29.691 CEST] [info] ::: lockapi::_try_lock: Retry 1 of 7...
[2020-07-29T06:07:39.758 CEST] [info] ::: lockapi::_try_lock: Retry 2 of 7...
[2020-07-29T06:07:49.828 CEST] [info] ::: lockapi::_try_lock: Retry 3 of 7...
[2020-07-29T06:07:59.916 CEST] [info] ::: lockapi::_try_lock: Retry 4 of 7...
[2020-07-29T06:08:10.001 CEST] [info] ::: lockapi::_try_lock: Retry 5 of 7...
[2020-07-29T06:08:20.118 CEST] [info] ::: lockapi::_try_lock: Retry 6 of 7...
[2020-07-29T06:08:30.207 CEST] [info] ::: lockapi::_try_lock: Retry 7 of 7...
[2020-07-29T06:08:40.208 CEST] [debug] tests/smt/smt_client1.pm:32 called lockapi::barrier_wait
[2020-07-29T06:08:40.208 CEST] [debug] <<< bmwqemu::mydie(cause_of_death="barrier 'smt_setup': lock owner already finished")
[2020-07-29T06:08:40.297 CEST] [info] ::: basetest::runtest: # Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41.

That I don't understand yet but for sure it is not the same problem as for the other job.

With the SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-07-20' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id; I look for all jobs with the ticket id in comment. To exclude openqaworker3 which was ill-configured I am using the SQL query:

select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-07-20' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id and host != 'openqaworker3';

which shows me 40 jobs right now. The latest job failed in https://openqa.suse.de/tests/4508248#step/boot_to_desktop/5 in the scenario "sle-12-SP4-Server-DVD-Updates-x86_64-mru-install-multipath-remote@64bit" with details

[2020-07-30T21:28:41.047 CEST] [debug] <<< testapi::send_key(key="ret", do_wait=0, wait_screen_change=0)
[2020-07-30T21:28:41.315 CEST] [debug] tests/boot/boot_to_desktop.pm:42 called opensusebasetest::wait_boot -> lib/opensusebasetest.pm:1022 called opensusebasetest::wait_boot_past_bootloader -> lib/opensusebasetest.pm:920 called opensusebasetest::handle_displaymanager_login -> lib/opensusebasetest.pm:753 called testapi::assert_screen
[2020-07-30T21:28:41.316 CEST] [debug] <<< testapi::assert_screen(mustmatch=[
  "displaymanager",
  "emergency-shell",
  "emergency-mode"
], timeout=300)
…
[2020-07-30T21:33:49.497 CEST] [debug] >>> testapi::_check_backend_response: match=displaymanager,emergency-mode,emergency-shell timed out after 300 (assert_screen)
[2020-07-30T21:33:49.539 CEST] [info] ::: basetest::runtest: # Test died: no candidate needle with tag(s) 'displaymanager, emergency-shell, emergency-mode' matched

while it looks like the system is still stuck in the PXE boot menu not acting on the "ret" that was pressed. I do not see the relation to the problem described in before at all.

@dzedro can you explain why you think that for example https://openqa.suse.de/tests/4508248# also fails due to problems of the network between the SUT VMs?

Actions #14

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

@dzedro can you explain why you think that for example https://openqa.suse.de/tests/4508248# also fails due to problems of the network between the SUT VMs?

Actions #15

Updated by dzedro over 4 years ago

Because it's MM and the failure (websocket refusing connection) is always on parent, often supportserver or just the test to which other test is PARALLEL_WITH.

Actions #16

Updated by okurz over 4 years ago

  • Related to action #50225: Make JOB_TIMEOUT incompletes more obvious added
Actions #17

Updated by okurz over 4 years ago

  • Related to action #52997: [sle][functional][tools]test fails in sshd - timeout_exceeded (13:20 hours) added
Actions #18

Updated by okurz over 4 years ago

  • Related to action #58379: isotovideo is slow to shutdown / error messages on proper shutdown added
Actions #19

Updated by okurz over 4 years ago

  • Related to action #57707: isotovideo fails to terminate cleanly, message "isotovideo: unable to inform websocket clients about stopping command server: Request timeout", regression from 4cd4af2b added
Actions #20

Updated by okurz over 4 years ago

  • Status changed from Feedback to In Progress

ok, I see. So the message "isotovideo: unable to inform websocket clients about stopping command server: Connection refused at /usr/bin/isotovideo line 175." is the one that you are concerned about, same as you wrote in the original description. So far I was sure that the message is only confusing but a followup of the actual error, not the cause. But then at least we can try to improve that. I added some related tickets around the same problem and will see what I can do to find out if that is the real problem, improve the error message and get the problems in these MM tests fixed.

Actions #21

Updated by mkittler over 4 years ago

I can confirm what @okurz said:

The message "isotovideo: unable to inform websocket clients…" is definitely not the cause of any problems besides avoidable error messages within the developer mode. This message would never cause a job to fail (be it a MM test or not).

However, a MM test which failed because a parallel job failed is indeed likely to contain the "isotovideo: unable to inform websocket clients…" message. That is because such a MM test is stopped by the worker which messes with the usual termination of isotovideo (to ensure there are no leftover processes). Hence isotovideo's web socket server might have already been terminated at this point. So if you want to know why your MM test failed this message is not a clue.

Not sure how to improve the error message to be less confusing. A simple solution would be to always also log additionally: "This only affects the developer mode and is unrelated to the test execution."
I've already tried that but only covered one place where the error is possibly logged and I suppose my phrasing wasn't too helpful as well.

Another solution would be to suppress the error message completely but I don't like that because it can make debugging extra-hard.

Actions #22

Updated by dzedro over 4 years ago

Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...

Actions #23

Updated by okurz over 4 years ago

dzedro wrote:

Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...

but this is what the subject line of your ticket says: "multimachine test fails due to websocket refusing connection" …

@mkittler You created https://github.com/os-autoinst/os-autoinst/pull/1500 . We should also keep in mind our history for related work, just to not run into regressions or back-and-forth: #45191 and #57707

What I found as reproducer for a "Connection refused" message started from os-autoinst/t/data/:

timeout -s INT -v 4 ../../isotovideo -d casedir=/home/okurz/local/os-autoinst/os-autoinst/t/data/tests

yielding

[2020-08-05T20:20:28.158 CEST] [debug] <<< testapi::assert_screen(mustmatch="core", timeout=15, no_wait=1)
[2020-08-05T20:20:28.175 CEST] [debug] no match: 14.9s, best candidate: boot-core (0.00)
[2020-08-05T20:20:28.276 CEST] [debug] no change: 14.8s
…
[2020-08-05T20:20:28.819 CEST] [debug] no change: 14.3s
timeout: sending signal INT to command ‘../../isotovideo’
[2020-08-05T20:20:28.839 CEST] [debug] isotovideo received signal INT
[2020-08-05T20:20:28.840 CEST] [debug] isotovideo received signal INT
[2020-08-05T20:20:28.840 CEST] [debug] QEMU: qemu-system-x86_64: terminating on signal 2 from pid 27273 (timeout)
[2020-08-05T20:20:28.841 CEST] [debug] no change: 14.3s
[2020-08-05T20:20:28.842 CEST] [debug] terminating command server 27278 because test execution ended
[2020-08-05T20:20:28.842 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:15223/PYkrRjvtBG/broadcast
[2020-08-05T20:20:28.848 CEST] [debug] commands process exited: 0
[2020-08-05T20:20:28.851 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Connection refused at ../../isotovideo line 175.

[2020-08-05T20:20:28.851 CEST] [debug] [autotest] process exited: 0
[2020-08-05T20:20:28.852 CEST] [debug] done with command server
[2020-08-05T20:20:28.852 CEST] [debug] stopping autotest process 27281
[2020-08-05T20:20:28.852 CEST] [debug] done with autotest process
[2020-08-05T20:20:28.852 CEST] [debug] isotovideo failed
[2020-08-05T20:20:28.853 CEST] [debug] stopping backend process 27298
[2020-08-05T20:20:28.854 CEST] [debug] backend got TERM
[2020-08-05T20:20:28.854 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2020-08-05T20:20:30.866 CEST] [debug] flushing frames
[2020-08-05T20:20:30.870 CEST] [debug] sending magic and exit
[2020-08-05T20:20:31.059 CEST] [debug] done with backend process
27274: EXIT 1

Created https://github.com/os-autoinst/os-autoinst/pull/1501 with my ideas.

Actions #24

Updated by dzedro over 4 years ago

okurz wrote:

dzedro wrote:

Nobody said anything about "isotovideo: unable to inform websocket clients…" being root cause.
The message is symptom which is there always in this MM failure...

but this is what the subject line of your ticket says: "multimachine test fails due to websocket refusing connection" …

@mkittler You created https://github.com/os-autoinst/os-autoinst/pull/1500 . We should also keep in mind our history for related work, just to not run into regressions or back-and-forth: #45191 and #57707

Yes, because all this MM failures have it in log, if there is root cause which is causing this or another description which is describing the issue more accurate then great.

Actions #25

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback
Actions #26

Updated by okurz over 4 years ago

change has been deployed to osd already.

Actions #27

Updated by okurz over 4 years ago

  • Related to action #63874: ensure openqa worker instances are disabled and stopped when "numofworkers" is reduced in salt pillars, e.g. causing non-obvious multi-machine failures added
Actions #28

Updated by okurz over 4 years ago

  • Related to action #69694: openqa-worker systemd services running in osd which should not be enabled at all and have no tap-device configured auto_review:"backend died:.*tap.*is not connected to bridge.*br1":retry added
Actions #29

Updated by okurz over 4 years ago

  • Subject changed from multimachine test fails due to websocket refusing connection to [epic] multimachine test fails with symptoms "websocket refusing connection" and other unclear reasons
  • Status changed from Feedback to In Progress

With #66376 done the feedback on incorrectly configured worker instances should be better. I have created the specific ticket #69694 so that auto-review can catch the issue and trigger automatic retries. We can try to solve the specific issue in #69694. Also with https://github.com/os-autoinst/os-autoinst/pull/1501 merged and deployed there are no more "Connection refused" messages expected. Based on how dzedro uses this ticket as more generic label I consider this an "epic" that might not even be solvable for good. We can still look in the database where the label is used. In some cases it is clearly wrong, e.g. https://openqa.suse.de/tests/4536070#step/iscsi_server/63 that now shows a needle mismatch. I created a corresponding new needle and retriggered as https://openqa.suse.de/tests/4536556.

EDIT: The last passed. Querying the database again shows currently:

 4535778 | 2020-08-07 01:54:50 | done  | failed | mru-install-multipath-remote |        | openqaworker6
 4535890 | 2020-08-07 03:05:50 | done  | failed | qam-rsync-client             |        | openqaworker10
 4537018 | 2020-08-07 13:55:58 | done  | failed | mru-install-multipath-remote |        | openqaworker8

4535890 failed in https://openqa.suse.de/tests/4535890#step/rsync_client/10 with # Test died: mydie at /usr/lib/os-autoinst/lockapi.pm line 41. which could certainly be more helpful -> created feature request #69715 which we should not be concerned with right now as this would be only UX improvement anyway.

Comparing logs from server and client in parallel with vim -O <(curl -s https://openqa.suse.de/tests/4535818/file/autoinst-log.txt) <(curl -s https://openqa.suse.de/tests/4535890/file/autoinst-log.txt) shows the following flow:

client: [2020-08-07T02:53:11.526 UTC] [debug] barrier wait 'rsync_setup'
…
client: [2020-08-07T02:54:22.190 UTC] [debug] <<< bmwqemu::mydie(cause_of_death="barrier 'rsync_setup': lock owner already finished")
…
server: [2020-08-07T05:00:54.528 CEST] [debug] barrier wait 'rsync_setup'
…
server: [2020-08-07T05:05:50.640 CEST] [debug] autotest received signal TERM, saving results of current test before exiting

note that the timezone differs on the two worker machines that have been used, UTC on openqaworker10 running the client and CEST on openqaworker8 running the server. This had been first reported in #62309#note-3 but not further followed on -> created #69718 for timezone harmonisation

What can be seen is that the client started the barrier wait but only waited until 02:54:22Z but the server only started the wait at 03:00:54Z, that is 6 mins later. The client job was still running at this time and only ended 03:05Z which also triggered the termination of the server job but this was merely the post_fail_hook executing, the waiting for synchronisation already failed way before. This is expected behaviour from side of openQA and really a problem in the design of the test -> #69787

Actions #30

Updated by okurz over 4 years ago

  • Due date set to 2020-09-01
  • Status changed from In Progress to Feedback

Have reported multiple specific issues to follow up with and also relabeled some tests and introduced auto-review tags for easier handling. After my vacation I can check for left cases.

Actions #31

Updated by dzedro over 4 years ago

I added workaround on QAM MM jobs, made MM jobs run on same worker with WORKER_CLASS=openqaworker. Tried to split it between all MM workers to not load only on and possibly delay the runs. Now there are no or much less failures.

Actions #32

Updated by okurz about 4 years ago

  • Due date changed from 2020-09-01 to 2020-09-30

The SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-09-01' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id; returned no jobs marked with this ticket anymore.

@dzedro would it be ok if you can remove the specific worker pinning again and we can check if there are any additional issues and then decide what to do?

Actions #33

Updated by dzedro about 4 years ago

I removed the workaround from few MM tests.

Actions #34

Updated by dzedro about 4 years ago

Removed workaround on all tests, could see failures related to this issue.

Actions #35

Updated by okurz about 4 years ago

With the SQL query select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where t_finished >= '2020-09-14' and jobs.id = comments.job_id and comments.text ~ '65118' and workers.id = assigned_worker_id; I found no jobs labeled with this ticket since 2020-09-14, have you found any?

Actions #36

Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

Seems we are ok here and we call the generic issue "Resolved". We still have open related issues that we identified.

Actions #37

Updated by szarate about 4 years ago

  • Tracker changed from action to coordination
Actions #39

Updated by okurz almost 4 years ago

  • Related to action #80570: [qe-core][qem][sporadic] test fails in smt_client1 auto_review:"mydie.*acquiring barrier 'smt_setup': lock owner already finished":retry added
Actions

Also available in: Atom PDF