LTP cgroup test appears to crash OpenQA worker instance
LTP test cgroup_fj_stress_blkio_4_4_each on latest SLE-15SP1 KOTD kernel appears to crash the OpenQA worker instance it's running on. The test itself will succeed but the OpenQA job will stay stuck in
wait_serial() for several hours (despite 90 second timeout) until the whole job fails on MAX_JOB_TIME. There are 3 examples so far:
I've seen this issue only on SLE-15SP1 KOTD builds 156 and 157. I have not seen any cases on other SLE versions.
Typical autoinst-log.txt entries related to the timeout:
[2022-12-06T08:52:27.432374+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", output="", quiet=undef, timeout=30, die_on_timeout=1) [2022-12-06T08:52:27.432549+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run [2022-12-06T08:52:27.432710+01:00] [debug] <<< testapi::wait_serial(record_output=undef, regexp="# ", quiet=undef, no_regex=1, buffer_size=undef, expect_not_found=0, timeout=90) [2022-12-06T10:39:58.278597+01:00] [debug] autotest received signal TERM, saving results of current test before exiting [2022-12-06T10:39:58.278622+01:00] [debug] isotovideo received signal TERM [2022-12-06T10:39:58.278748+01:00] [debug] backend got TERM
Last good: 4.12.14-150100.156.1.gb6c27ee (or more recent)
Always latest result in this scenario: latest
Steps to reproduce:¶
ltp_controllerstestsuite on SLE-15SP1 KOTD
#1 Updated by okurz 4 months ago
- Target version set to future
that looks like it's not crashing the worker instance even if maybe it should, right?
Please add more information at best following the openQA bug reporting template that you get when you click the "report bug" icon from openQA. Most important are "link to latest" and "steps to reproduce"
#5 Updated by MDoucha 3 months ago
Here's a new example on SLE-12SP4 so it's not just 15SP1 anymore:
Once again, the job times out in
wait_serial() waiting for shell prompt after the main test process finished. And the shell prompt is right there at the very end of serial_terminal.txt:
#6 Updated by MDoucha 3 months ago
#7 Updated by rpalethorpe 3 months ago
looking at a successful run of vmstat
[2022-12-16T14:39:06.501999+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, output="", timeout=30) [2022-12-16T14:39:06.502157+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run [2022-12-16T14:39:06.502380+01:00] [debug] <<< testapi::wait_serial(record_output=undef, quiet=undef, buffer_size=undef, regexp="# ", timeout=90, no_regex=1, expect_not_found=0) [2022-12-16T14:39:06.503503+01:00] [debug] <<< consoles::serial_screen::read_until(pattern=[ "# " ], timeout=90, regexp="# ", record_output=undef, quiet=undef, expect_not_found=0, no_regex=1, json_cmd_token="NIYqbLbL", buffer_size=undef, cmd="backend_wait_serial")
read_until announces its entry. Now on an unsuccessful run:
[2022-12-16T14:39:11.582074+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, timeout=30, output="") [2022-12-16T14:39:11.582253+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run [2022-12-16T14:39:11.582491+01:00] [debug] <<< testapi::wait_serial(buffer_size=undef, regexp="# ", quiet=undef, record_output=undef, expect_not_found=0, no_regex=1, timeout=90)
here we just enter wait_serial on the front-end. It's not clear if the backend ever receives the command.
#8 Updated by rpalethorpe 3 months ago
Created a PR for printing the PID: https://github.com/os-autoinst/os-autoinst/pull/2237
myjsonrpc has a debug option which is set with an env variable: PERL_MYJSONRPC_DEBUG
It is extremely odd that it always happens on the same test, but the test is just moving some processes between CGroups. This is totally internal to the SUT except for the log output it generates, which is quite noisy.
#9 Updated by rpalethorpe about 2 months ago
- Status changed from New to In Progress
Theoretically the JSON message could be never being read if the select in the backend's main loop never returns the FD for the cmd pipe. I don't see how that is possible in practice though. It would be nice to rule this out, so to that end I have created this PR https://github.com/os-autoinst/os-autoinst/pull/2259
#10 Updated by rpalethorpe about 2 months ago
Here is new output now that we have PID in log
[2023-02-02T02:53:41.010973+01:00] [debug] [pid:5845] <<< testapi::script_run(cmd="vmstat -w", quiet=undef, timeout=30, die_on_timeout=1, output="") [2023-02-02T02:53:41.011182+01:00] [debug] [pid:5845] tests/kernel/run_ltp.pm:334 called testapi::script_run [2023-02-02T02:53:41.011473+01:00] [debug] [pid:5845] <<< testapi::wait_serial(buffer_size=undef, quiet=undef, no_regex=1, expect_not_found=0, timeout=90, record_output=undef, regexp="# ")
Timeout happens here. Testapi sends wait_serial, but we never see read_until being entered.
Below QEMU says it is killed by PID 1047 which is the worker process. This is confirmed by the worker log.
[2023-02-02T08:38:30.254667+01:00] [debug] [pid:5845] autotest received signal TERM, saving results of current test before exiting [2023-02-02T08:38:30.256697+01:00] [debug] [pid:5644] isotovideo received signal TERM [2023-02-02T08:38:30.259413+01:00] [debug] [pid:5911] backend got TERM [2023-02-02T08:38:30.260873+01:00] [info] [pid:5911] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json [2023-02-02T08:38:30.268891+01:00] [debug] [pid:5644] commands process exited: 0 [2023-02-02T08:38:30.274473+01:00] [debug] [pid:5644] [autotest] process exited: 1 [2023-02-02T08:38:31.303807+01:00] [debug] [pid:5911] Passing remaining frames to the video encoder [2023-02-02T08:38:31.304357+01:00] [debug] [pid:5911] Waiting for video encoder to finalize the video [2023-02-02T08:38:31.304536+01:00] [debug] [pid:5911] The built-in video encoder (pid 5973) terminated [2023-02-02T08:38:31.307130+01:00] [debug] [pid:5911] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1047 (/usr/bin/perl) [2023-02-02T08:38:31.313280+01:00] [debug] [pid:5911] sending magic and exit [2023-02-02T08:38:31.314229+01:00] [debug] [pid:5644] received magic close [2023-02-02T08:38:31.316686+01:00] [warn] [pid:5644] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 15 4 3 myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43. myjsonrpc::send_json(GLOB(0x55e175a60750), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 117 OpenQA::Isotovideo::CommandHandler::_send_to_cmd_srv(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 91 OpenQA::Isotovideo::CommandHandler::stop_command_processing(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 51 OpenQA::Isotovideo::Runner::run(OpenQA::Isotovideo::Runner=HASH(0x55e172bf6d88)) called at /usr/bin/isotovideo line 143 [2023-02-02T08:38:31.319126+01:00] [debug] [pid:5644] stopping backend process 5911 [2023-02-02T08:38:31.520546+01:00] [debug] [pid:5644] done with backend process [2023-02-02T08:38:31.520846+01:00] [debug] [pid:5644] stopping autotest process 5845 [2023-02-02T08:38:31.521024+01:00] [debug] [pid:5644] done with autotest process 5644: EXIT 0
There is no timeout set on the do_capture loop and the stall detection is not active with the serial console. So there is nothing to rule out that it is spinning in that loop.
Something else which could cause this, is if the testapi front end sends two messages without waiting for a response. Then the backend could read both draining the socket buffer. However the second one will not be read because it will be sitting in the JSON parser's buffer. This is only checked when the fd has data. It's hard to rule this out.
#11 Updated by rpalethorpe about 1 month ago
The os-autoinst PR has been merged. So setting DEBUG_JSON_RPC on the job will log all of the RPCs. It's probably best to do it on a cloned job first.