https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842022-12-09T13:49:27ZopenSUSE Project Management ToolopenQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5832042022-12-09T13:49:27Zokurzokurz@suse.com
<ul><li><strong>Target version</strong> set to <i>future</i></li></ul><p>that looks like it's <em>not</em> crashing the worker instance even if maybe it should, right?</p>
<p>Please add more information at best following the openQA bug reporting template that you get when you click the "report bug" icon from openQA. Most important are "link to latest" and "steps to reproduce"</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5832342022-12-09T14:06:14ZMDouchamartin.doucha@suse.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/583234/diff?detail_id=548710">diff</a>)</li></ul> openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5832552022-12-09T14:12:32ZMDouchamartin.doucha@suse.com
<ul></ul><p>okurz wrote:</p>
<blockquote>
<p>that looks like it's <em>not</em> crashing the worker instance even if maybe it should, right?</p>
</blockquote>
<p><code>wait_serial()</code> clearly does not time out when it's supposed to.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5832672022-12-09T14:19:12ZMDouchamartin.doucha@suse.com
<ul></ul><p>Or more precisely, the <code>wait_serial()</code> does not return when it should. The last <code>wait_serial()</code> that gets executed is waiting for <code>#</code> prompt and I can see the prompt in serial_terminal.txt. Yet the <code>wait_serial()</code> call ignores it and stays stuck indefinitely.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5859192022-12-19T09:55:08ZMDouchamartin.doucha@suse.com
<ul></ul><p>Here's a new example on SLE-12SP4 so it's not just 15SP1 anymore:<br>
<a href="https://openqa.suse.de/tests/10189630#step/cgroup_fj_stress_blkio_4_4_each/8" class="external">https://openqa.suse.de/tests/10189630#step/cgroup_fj_stress_blkio_4_4_each/8</a></p>
<p>Once again, the job times out in <code>wait_serial()</code> waiting for shell prompt after the main test process finished. And the shell prompt is right there at the very end of serial_terminal.txt:<br>
<a href="https://openqa.suse.de/tests/10189630/logfile?filename=serial_terminal.txt" class="external">https://openqa.suse.de/tests/10189630/logfile?filename=serial_terminal.txt</a></p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5859552022-12-19T10:21:45ZMDouchamartin.doucha@suse.com
<ul></ul><p>And another case on SLE-15SP3:<br>
<a href="https://openqa.suse.de/tests/10192511#step/cgroup_fj_stress_blkio_4_4_each/7" class="external">https://openqa.suse.de/tests/10192511#step/cgroup_fj_stress_blkio_4_4_each/7</a><br>
<a href="https://openqa.suse.de/tests/10192511/logfile?filename=serial_terminal.txt" class="external">https://openqa.suse.de/tests/10192511/logfile?filename=serial_terminal.txt</a></p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5860002022-12-19T11:59:51Zrpalethorperichard.palethorpe@suse.com
<ul></ul><p>looking at a successful run of vmstat</p>
<pre><code>[2022-12-16T14:39:06.501999+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, output="", timeout=30)
[2022-12-16T14:39:06.502157+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:06.502380+01:00] [debug] <<< testapi::wait_serial(record_output=undef, quiet=undef, buffer_size=undef, regexp="# ", timeout=90, no_regex=1, expect_not_found=0)
[2022-12-16T14:39:06.503503+01:00] [debug] <<< consoles::serial_screen::read_until(pattern=[
"# "
], timeout=90, regexp="# ", record_output=undef, quiet=undef, expect_not_found=0, no_regex=1, json_cmd_token="NIYqbLbL", buffer_size=undef, cmd="backend_wait_serial")
</code></pre>
<p>read_until announces its entry. Now on an unsuccessful run:</p>
<pre><code>[2022-12-16T14:39:11.582074+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, timeout=30, output="")
[2022-12-16T14:39:11.582253+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:11.582491+01:00] [debug] <<< testapi::wait_serial(buffer_size=undef, regexp="# ", quiet=undef, record_output=undef, expect_not_found=0, no_regex=1, timeout=90)
</code></pre>
<p>here we just enter wait_serial on the front-end. It's not clear if the backend ever receives the command.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5861022022-12-19T14:22:57Zrpalethorperichard.palethorpe@suse.com
<ul></ul><p>Created a PR for printing the PID: <a href="https://github.com/os-autoinst/os-autoinst/pull/2237" class="external">https://github.com/os-autoinst/os-autoinst/pull/2237</a><br>
myjsonrpc has a debug option which is set with an env variable: PERL_MYJSONRPC_DEBUG</p>
<p>It is extremely odd that it always happens on the same test, but the test is just moving some processes between CGroups. This is totally internal to the SUT except for the log output it generates, which is quite noisy.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5985212023-02-06T13:32:11Zrpalethorperichard.palethorpe@suse.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Theoretically the JSON message could be never being read if the select in the backend's main loop never returns the FD for the cmd pipe. I don't see how that is possible in practice though. It would be nice to rule this out, so to that end I have created this PR <a href="https://github.com/os-autoinst/os-autoinst/pull/2259" class="external">https://github.com/os-autoinst/os-autoinst/pull/2259</a></p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=5987882023-02-07T08:49:52Zrpalethorperichard.palethorpe@suse.com
<ul></ul><p>Here is new output now that we have PID in log</p>
<pre><code>[2023-02-02T02:53:41.010973+01:00] [debug] [pid:5845] <<< testapi::script_run(cmd="vmstat -w", quiet=undef, timeout=30, die_on_timeout=1, output="")
[2023-02-02T02:53:41.011182+01:00] [debug] [pid:5845] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-02-02T02:53:41.011473+01:00] [debug] [pid:5845] <<< testapi::wait_serial(buffer_size=undef, quiet=undef, no_regex=1, expect_not_found=0, timeout=90, record_output=undef, regexp="# ")
</code></pre>
<p>Timeout happens here. Testapi sends wait_serial, but we never see read_until being entered.</p>
<p>Below QEMU says it is killed by PID 1047 which is the worker process. This is confirmed by the worker log.</p>
<pre><code>[2023-02-02T08:38:30.254667+01:00] [debug] [pid:5845] autotest received signal TERM, saving results of current test before exiting
[2023-02-02T08:38:30.256697+01:00] [debug] [pid:5644] isotovideo received signal TERM
[2023-02-02T08:38:30.259413+01:00] [debug] [pid:5911] backend got TERM
[2023-02-02T08:38:30.260873+01:00] [info] [pid:5911] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-02-02T08:38:30.268891+01:00] [debug] [pid:5644] commands process exited: 0
[2023-02-02T08:38:30.274473+01:00] [debug] [pid:5644] [autotest] process exited: 1
[2023-02-02T08:38:31.303807+01:00] [debug] [pid:5911] Passing remaining frames to the video encoder
[2023-02-02T08:38:31.304357+01:00] [debug] [pid:5911] Waiting for video encoder to finalize the video
[2023-02-02T08:38:31.304536+01:00] [debug] [pid:5911] The built-in video encoder (pid 5973) terminated
[2023-02-02T08:38:31.307130+01:00] [debug] [pid:5911] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1047 (/usr/bin/perl)
[2023-02-02T08:38:31.313280+01:00] [debug] [pid:5911] sending magic and exit
[2023-02-02T08:38:31.314229+01:00] [debug] [pid:5644] received magic close
[2023-02-02T08:38:31.316686+01:00] [warn] [pid:5644] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 15 4 3
myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43.
myjsonrpc::send_json(GLOB(0x55e175a60750), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 117
OpenQA::Isotovideo::CommandHandler::_send_to_cmd_srv(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 91
OpenQA::Isotovideo::CommandHandler::stop_command_processing(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 51
OpenQA::Isotovideo::Runner::run(OpenQA::Isotovideo::Runner=HASH(0x55e172bf6d88)) called at /usr/bin/isotovideo line 143
[2023-02-02T08:38:31.319126+01:00] [debug] [pid:5644] stopping backend process 5911
[2023-02-02T08:38:31.520546+01:00] [debug] [pid:5644] done with backend process
[2023-02-02T08:38:31.520846+01:00] [debug] [pid:5644] stopping autotest process 5845
[2023-02-02T08:38:31.521024+01:00] [debug] [pid:5644] done with autotest process
5644: EXIT 0
</code></pre>
<p>There is no timeout set on the do_capture loop and the stall detection is not active with the serial console. So there is nothing to rule out that it is spinning in that loop.</p>
<p>Something else which could cause this, is if the testapi front end sends two messages without waiting for a response. Then the backend could read both draining the socket buffer. However the second one will not be read because it will be sitting in the JSON parser's buffer. This is only checked when the fd has data. It's hard to rule this out.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=6015562023-02-14T09:38:12Zrpalethorperichard.palethorpe@suse.com
<ul></ul><p>The os-autoinst PR has been merged. So setting DEBUG_JSON_RPC on the job will log all of the RPCs. It's probably best to do it on a cloned job first.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=6762412023-09-21T10:16:55Zrpalethorperichard.palethorpe@suse.com
<ul><li><strong>Assignee</strong> set to <i>rpalethorpe</i></li></ul><p>This happened again (see link above).</p>
<p>After enabling DEBUG_JSON_RPC it passed. Which is not unexpected because the bug is random and slowing down the process with debug messages may prevent it from happening.</p>
<p>There are a number of things that could be done:</p>
<ol>
<li>Use kernel and uprobe tracing (assuming the later is practical with Perl)</li>
<li>Start rewriting the code</li>
<li>Replace the RPC code with a standard library</li>
</ol>
<p>The first one requires root access to a worker instance. The later two are likely to result in more issues.</p>
openQA Project - action #121774: LTP cgroup test appears to crash OpenQA worker instancehttps://progress.opensuse.org/issues/121774?journal_id=6840622023-10-10T10:25:55ZMDouchamartin.doucha@suse.com
<ul></ul><p>After enabling <code>DEBUG_JSON_RPC</code> on all ltp_controllers jobs, we got some slightly interesting output. Alas, it the true cause still remains a mystery.<br>
<a href="https://openqa.suse.de/tests/12430237" class="external">https://openqa.suse.de/tests/12430237</a><br>
<a href="https://openqa.suse.de/tests/12430237/logfile?filename=autoinst-log.txt" class="external">https://openqa.suse.de/tests/12430237/logfile?filename=autoinst-log.txt</a></p>
<pre><code>[2023-10-09T14:38:15.261431+02:00] [debug] [pid:114394] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, timeout=30, quiet=undef, output="")
[2023-10-09T14:38:15.261556+02:00] [debug] [pid:114394] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-10-09T14:38:15.261682+02:00] [debug] [pid:114394] <<< testapi::wait_serial(regexp="# ", record_output=undef, timeout=90, buffer_size=undef, no_regex=1, quiet=undef, expect_not_found=0)
[2023-10-09T14:38:15.261800+02:00] [debug] [pid:114394] send_json(5) JSON={"buffer_size":null,"cmd":"backend_wait_serial","expect_not_found":0,"json_cmd_token":"LlLeAPyt","no_regex":1,"quiet":null,"record_output":null,"regexp":"# ","timeout":90}
[2023-10-09T14:38:15.261865+02:00] [debug] [pid:114394] read_json(5)
[2023-10-09T20:27:51.091266+02:00] [debug] [pid:114394] autotest received signal TERM, saving results of current test before exiting
</code></pre>