action #121774
openLTP cgroup test appears to crash OpenQA worker instance
0%
Description
LTP test cgroup_fj_stress_blkio_4_4_each on latest SLE-15SP1 KOTD kernel appears to crash the OpenQA worker instance it's running on. The test itself will succeed but the OpenQA job will stay stuck in wait_serial()
for several hours (despite 90 second timeout) until the whole job fails on MAX_JOB_TIME. There are 3 examples so far:
https://openqa.suse.de/tests/10089424#step/cgroup_fj_stress_blkio_4_4_each/7
https://openqa.suse.de/tests/10111009#step/cgroup_fj_stress_blkio_4_4_each/7
https://openqa.suse.de/tests/10113099#step/cgroup_fj_stress_blkio_4_4_each/7
I've seen this issue only on SLE-15SP1 KOTD builds 156 and 157. I have not seen any cases on other SLE versions.
Typical autoinst-log.txt entries related to the timeout:
[2022-12-06T08:52:27.432374+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", output="", quiet=undef, timeout=30, die_on_timeout=1)
[2022-12-06T08:52:27.432549+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-06T08:52:27.432710+01:00] [debug] <<< testapi::wait_serial(record_output=undef, regexp="# ", quiet=undef, no_regex=1, buffer_size=undef, expect_not_found=0, timeout=90)
[2022-12-06T10:39:58.278597+01:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-12-06T10:39:58.278622+01:00] [debug] isotovideo received signal TERM
[2022-12-06T10:39:58.278748+01:00] [debug] backend got TERM
Expected result¶
Last good: 4.12.14-150100.156.1.gb6c27ee (or more recent)
Further details¶
Always latest result in this scenario: latest
Steps to reproduce:¶
- Run
ltp_controllers
testsuite on SLE-15SP1 KOTD - Wait.
Updated by okurz about 2 years ago
- Target version set to future
that looks like it's not crashing the worker instance even if maybe it should, right?
Please add more information at best following the openQA bug reporting template that you get when you click the "report bug" icon from openQA. Most important are "link to latest" and "steps to reproduce"
Updated by MDoucha about 2 years ago
okurz wrote:
that looks like it's not crashing the worker instance even if maybe it should, right?
wait_serial()
clearly does not time out when it's supposed to.
Updated by MDoucha about 2 years ago
Or more precisely, the wait_serial()
does not return when it should. The last wait_serial()
that gets executed is waiting for #
prompt and I can see the prompt in serial_terminal.txt. Yet the wait_serial()
call ignores it and stays stuck indefinitely.
Updated by MDoucha about 2 years ago
Here's a new example on SLE-12SP4 so it's not just 15SP1 anymore:
https://openqa.suse.de/tests/10189630#step/cgroup_fj_stress_blkio_4_4_each/8
Once again, the job times out in wait_serial()
waiting for shell prompt after the main test process finished. And the shell prompt is right there at the very end of serial_terminal.txt:
https://openqa.suse.de/tests/10189630/logfile?filename=serial_terminal.txt
Updated by MDoucha about 2 years ago
Updated by rpalethorpe about 2 years ago
looking at a successful run of vmstat
[2022-12-16T14:39:06.501999+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, output="", timeout=30)
[2022-12-16T14:39:06.502157+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:06.502380+01:00] [debug] <<< testapi::wait_serial(record_output=undef, quiet=undef, buffer_size=undef, regexp="# ", timeout=90, no_regex=1, expect_not_found=0)
[2022-12-16T14:39:06.503503+01:00] [debug] <<< consoles::serial_screen::read_until(pattern=[
"# "
], timeout=90, regexp="# ", record_output=undef, quiet=undef, expect_not_found=0, no_regex=1, json_cmd_token="NIYqbLbL", buffer_size=undef, cmd="backend_wait_serial")
read_until announces its entry. Now on an unsuccessful run:
[2022-12-16T14:39:11.582074+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, timeout=30, output="")
[2022-12-16T14:39:11.582253+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:11.582491+01:00] [debug] <<< testapi::wait_serial(buffer_size=undef, regexp="# ", quiet=undef, record_output=undef, expect_not_found=0, no_regex=1, timeout=90)
here we just enter wait_serial on the front-end. It's not clear if the backend ever receives the command.
Updated by rpalethorpe about 2 years ago
Created a PR for printing the PID: https://github.com/os-autoinst/os-autoinst/pull/2237
myjsonrpc has a debug option which is set with an env variable: PERL_MYJSONRPC_DEBUG
It is extremely odd that it always happens on the same test, but the test is just moving some processes between CGroups. This is totally internal to the SUT except for the log output it generates, which is quite noisy.
Updated by rpalethorpe almost 2 years ago
- Status changed from New to In Progress
Theoretically the JSON message could be never being read if the select in the backend's main loop never returns the FD for the cmd pipe. I don't see how that is possible in practice though. It would be nice to rule this out, so to that end I have created this PR https://github.com/os-autoinst/os-autoinst/pull/2259
Updated by rpalethorpe almost 2 years ago
Here is new output now that we have PID in log
[2023-02-02T02:53:41.010973+01:00] [debug] [pid:5845] <<< testapi::script_run(cmd="vmstat -w", quiet=undef, timeout=30, die_on_timeout=1, output="")
[2023-02-02T02:53:41.011182+01:00] [debug] [pid:5845] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-02-02T02:53:41.011473+01:00] [debug] [pid:5845] <<< testapi::wait_serial(buffer_size=undef, quiet=undef, no_regex=1, expect_not_found=0, timeout=90, record_output=undef, regexp="# ")
Timeout happens here. Testapi sends wait_serial, but we never see read_until being entered.
Below QEMU says it is killed by PID 1047 which is the worker process. This is confirmed by the worker log.
[2023-02-02T08:38:30.254667+01:00] [debug] [pid:5845] autotest received signal TERM, saving results of current test before exiting
[2023-02-02T08:38:30.256697+01:00] [debug] [pid:5644] isotovideo received signal TERM
[2023-02-02T08:38:30.259413+01:00] [debug] [pid:5911] backend got TERM
[2023-02-02T08:38:30.260873+01:00] [info] [pid:5911] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-02-02T08:38:30.268891+01:00] [debug] [pid:5644] commands process exited: 0
[2023-02-02T08:38:30.274473+01:00] [debug] [pid:5644] [autotest] process exited: 1
[2023-02-02T08:38:31.303807+01:00] [debug] [pid:5911] Passing remaining frames to the video encoder
[2023-02-02T08:38:31.304357+01:00] [debug] [pid:5911] Waiting for video encoder to finalize the video
[2023-02-02T08:38:31.304536+01:00] [debug] [pid:5911] The built-in video encoder (pid 5973) terminated
[2023-02-02T08:38:31.307130+01:00] [debug] [pid:5911] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1047 (/usr/bin/perl)
[2023-02-02T08:38:31.313280+01:00] [debug] [pid:5911] sending magic and exit
[2023-02-02T08:38:31.314229+01:00] [debug] [pid:5644] received magic close
[2023-02-02T08:38:31.316686+01:00] [warn] [pid:5644] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 15 4 3
myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43.
myjsonrpc::send_json(GLOB(0x55e175a60750), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 117
OpenQA::Isotovideo::CommandHandler::_send_to_cmd_srv(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 91
OpenQA::Isotovideo::CommandHandler::stop_command_processing(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 51
OpenQA::Isotovideo::Runner::run(OpenQA::Isotovideo::Runner=HASH(0x55e172bf6d88)) called at /usr/bin/isotovideo line 143
[2023-02-02T08:38:31.319126+01:00] [debug] [pid:5644] stopping backend process 5911
[2023-02-02T08:38:31.520546+01:00] [debug] [pid:5644] done with backend process
[2023-02-02T08:38:31.520846+01:00] [debug] [pid:5644] stopping autotest process 5845
[2023-02-02T08:38:31.521024+01:00] [debug] [pid:5644] done with autotest process
5644: EXIT 0
There is no timeout set on the do_capture loop and the stall detection is not active with the serial console. So there is nothing to rule out that it is spinning in that loop.
Something else which could cause this, is if the testapi front end sends two messages without waiting for a response. Then the backend could read both draining the socket buffer. However the second one will not be read because it will be sitting in the JSON parser's buffer. This is only checked when the fd has data. It's hard to rule this out.
Updated by rpalethorpe almost 2 years ago
The os-autoinst PR has been merged. So setting DEBUG_JSON_RPC on the job will log all of the RPCs. It's probably best to do it on a cloned job first.
Updated by rpalethorpe over 1 year ago
- Assignee set to rpalethorpe
This happened again (see link above).
After enabling DEBUG_JSON_RPC it passed. Which is not unexpected because the bug is random and slowing down the process with debug messages may prevent it from happening.
There are a number of things that could be done:
- Use kernel and uprobe tracing (assuming the later is practical with Perl)
- Start rewriting the code
- Replace the RPC code with a standard library
The first one requires root access to a worker instance. The later two are likely to result in more issues.
Updated by MDoucha about 1 year ago
After enabling DEBUG_JSON_RPC
on all ltp_controllers jobs, we got some slightly interesting output. Alas, it the true cause still remains a mystery.
https://openqa.suse.de/tests/12430237
https://openqa.suse.de/tests/12430237/logfile?filename=autoinst-log.txt
[2023-10-09T14:38:15.261431+02:00] [debug] [pid:114394] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, timeout=30, quiet=undef, output="")
[2023-10-09T14:38:15.261556+02:00] [debug] [pid:114394] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-10-09T14:38:15.261682+02:00] [debug] [pid:114394] <<< testapi::wait_serial(regexp="# ", record_output=undef, timeout=90, buffer_size=undef, no_regex=1, quiet=undef, expect_not_found=0)
[2023-10-09T14:38:15.261800+02:00] [debug] [pid:114394] send_json(5) JSON={"buffer_size":null,"cmd":"backend_wait_serial","expect_not_found":0,"json_cmd_token":"LlLeAPyt","no_regex":1,"quiet":null,"record_output":null,"regexp":"# ","timeout":90}
[2023-10-09T14:38:15.261865+02:00] [debug] [pid:114394] read_json(5)
[2023-10-09T20:27:51.091266+02:00] [debug] [pid:114394] autotest received signal TERM, saving results of current test before exiting
Updated by MDoucha 6 months ago
A few weeks ago, the fanotify14
test running on SLE-15SP4 and SLE-15SP5 VMs started triggering the same issue on x86_64 workers:
https://openqa.suse.de/tests/14666205#step/fanotify14/1
https://openqa.suse.de/tests/14666752#step/fanotify14/1