action #121774

LTP cgroup test appears to crash OpenQA worker instance

Added by MDoucha 4 months ago. Updated about 1 month ago.

In Progress
Concrete Bugs
Target version:
Start date:
Due date:
% Done:


Estimated time:


LTP test cgroup_fj_stress_blkio_4_4_each on latest SLE-15SP1 KOTD kernel appears to crash the OpenQA worker instance it's running on. The test itself will succeed but the OpenQA job will stay stuck in wait_serial() for several hours (despite 90 second timeout) until the whole job fails on MAX_JOB_TIME. There are 3 examples so far:

I've seen this issue only on SLE-15SP1 KOTD builds 156 and 157. I have not seen any cases on other SLE versions.

Typical autoinst-log.txt entries related to the timeout:

[2022-12-06T08:52:27.432374+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", output="", quiet=undef, timeout=30, die_on_timeout=1)
[2022-12-06T08:52:27.432549+01:00] [debug] tests/kernel/ called testapi::script_run
[2022-12-06T08:52:27.432710+01:00] [debug] <<< testapi::wait_serial(record_output=undef, regexp="# ", quiet=undef, no_regex=1, buffer_size=undef, expect_not_found=0, timeout=90)
[2022-12-06T10:39:58.278597+01:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-12-06T10:39:58.278622+01:00] [debug] isotovideo received signal TERM
[2022-12-06T10:39:58.278748+01:00] [debug] backend got TERM

Expected result

Last good: 4.12.14-150100.156.1.gb6c27ee (or more recent)

Further details

Always latest result in this scenario: latest

Steps to reproduce:

  1. Run ltp_controllers testsuite on SLE-15SP1 KOTD
  2. Wait.


#1 Updated by okurz 4 months ago

  • Target version set to future

that looks like it's not crashing the worker instance even if maybe it should, right?

Please add more information at best following the openQA bug reporting template that you get when you click the "report bug" icon from openQA. Most important are "link to latest" and "steps to reproduce"

#2 Updated by MDoucha 4 months ago

  • Description updated (diff)

#3 Updated by MDoucha 4 months ago

okurz wrote:

that looks like it's not crashing the worker instance even if maybe it should, right?

wait_serial() clearly does not time out when it's supposed to.

#4 Updated by MDoucha 4 months ago

Or more precisely, the wait_serial() does not return when it should. The last wait_serial() that gets executed is waiting for # prompt and I can see the prompt in serial_terminal.txt. Yet the wait_serial() call ignores it and stays stuck indefinitely.

#5 Updated by MDoucha 3 months ago

Here's a new example on SLE-12SP4 so it's not just 15SP1 anymore:

Once again, the job times out in wait_serial() waiting for shell prompt after the main test process finished. And the shell prompt is right there at the very end of serial_terminal.txt:

#7 Updated by rpalethorpe 3 months ago

looking at a successful run of vmstat

[2022-12-16T14:39:06.501999+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, output="", timeout=30)
[2022-12-16T14:39:06.502157+01:00] [debug] tests/kernel/ called testapi::script_run
[2022-12-16T14:39:06.502380+01:00] [debug] <<< testapi::wait_serial(record_output=undef, quiet=undef, buffer_size=undef, regexp="# ", timeout=90, no_regex=1, expect_not_found=0)
[2022-12-16T14:39:06.503503+01:00] [debug] <<< consoles::serial_screen::read_until(pattern=[
    "# "
  ], timeout=90, regexp="# ", record_output=undef, quiet=undef, expect_not_found=0, no_regex=1, json_cmd_token="NIYqbLbL", buffer_size=undef, cmd="backend_wait_serial")

read_until announces its entry. Now on an unsuccessful run:

[2022-12-16T14:39:11.582074+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, timeout=30, output="")
[2022-12-16T14:39:11.582253+01:00] [debug] tests/kernel/ called testapi::script_run
[2022-12-16T14:39:11.582491+01:00] [debug] <<< testapi::wait_serial(buffer_size=undef, regexp="# ", quiet=undef, record_output=undef, expect_not_found=0, no_regex=1, timeout=90)

here we just enter wait_serial on the front-end. It's not clear if the backend ever receives the command.

#8 Updated by rpalethorpe 3 months ago

Created a PR for printing the PID:
myjsonrpc has a debug option which is set with an env variable: PERL_MYJSONRPC_DEBUG

It is extremely odd that it always happens on the same test, but the test is just moving some processes between CGroups. This is totally internal to the SUT except for the log output it generates, which is quite noisy.

#9 Updated by rpalethorpe about 2 months ago

  • Status changed from New to In Progress

Theoretically the JSON message could be never being read if the select in the backend's main loop never returns the FD for the cmd pipe. I don't see how that is possible in practice though. It would be nice to rule this out, so to that end I have created this PR

#10 Updated by rpalethorpe about 2 months ago

Here is new output now that we have PID in log

[2023-02-02T02:53:41.010973+01:00] [debug] [pid:5845] <<< testapi::script_run(cmd="vmstat -w", quiet=undef, timeout=30, die_on_timeout=1, output="")
[2023-02-02T02:53:41.011182+01:00] [debug] [pid:5845] tests/kernel/ called testapi::script_run
[2023-02-02T02:53:41.011473+01:00] [debug] [pid:5845] <<< testapi::wait_serial(buffer_size=undef, quiet=undef, no_regex=1, expect_not_found=0, timeout=90, record_output=undef, regexp="# ")

Timeout happens here. Testapi sends wait_serial, but we never see read_until being entered.

Below QEMU says it is killed by PID 1047 which is the worker process. This is confirmed by the worker log.

[2023-02-02T08:38:30.254667+01:00] [debug] [pid:5845] autotest received signal TERM, saving results of current test before exiting
[2023-02-02T08:38:30.256697+01:00] [debug] [pid:5644] isotovideo received signal TERM
[2023-02-02T08:38:30.259413+01:00] [debug] [pid:5911] backend got TERM
[2023-02-02T08:38:30.260873+01:00] [info] [pid:5911] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-02-02T08:38:30.268891+01:00] [debug] [pid:5644] commands process exited: 0
[2023-02-02T08:38:30.274473+01:00] [debug] [pid:5644] [autotest] process exited: 1
[2023-02-02T08:38:31.303807+01:00] [debug] [pid:5911] Passing remaining frames to the video encoder
[2023-02-02T08:38:31.304357+01:00] [debug] [pid:5911] Waiting for video encoder to finalize the video
[2023-02-02T08:38:31.304536+01:00] [debug] [pid:5911] The built-in video encoder (pid 5973) terminated
[2023-02-02T08:38:31.307130+01:00] [debug] [pid:5911] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1047 (/usr/bin/perl)
[2023-02-02T08:38:31.313280+01:00] [debug] [pid:5911] sending magic and exit
[2023-02-02T08:38:31.314229+01:00] [debug] [pid:5644] received magic close
[2023-02-02T08:38:31.316686+01:00] [warn] [pid:5644] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 15 4 3
myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/ line 43.
    myjsonrpc::send_json(GLOB(0x55e175a60750), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/ line 117
    OpenQA::Isotovideo::CommandHandler::_send_to_cmd_srv(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/ line 91
    OpenQA::Isotovideo::CommandHandler::stop_command_processing(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/ line 51
    OpenQA::Isotovideo::Runner::run(OpenQA::Isotovideo::Runner=HASH(0x55e172bf6d88)) called at /usr/bin/isotovideo line 143
[2023-02-02T08:38:31.319126+01:00] [debug] [pid:5644] stopping backend process 5911
[2023-02-02T08:38:31.520546+01:00] [debug] [pid:5644] done with backend process
[2023-02-02T08:38:31.520846+01:00] [debug] [pid:5644] stopping autotest process 5845
[2023-02-02T08:38:31.521024+01:00] [debug] [pid:5644] done with autotest process
5644: EXIT 0

There is no timeout set on the do_capture loop and the stall detection is not active with the serial console. So there is nothing to rule out that it is spinning in that loop.

Something else which could cause this, is if the testapi front end sends two messages without waiting for a response. Then the backend could read both draining the socket buffer. However the second one will not be read because it will be sitting in the JSON parser's buffer. This is only checked when the fd has data. It's hard to rule this out.

#11 Updated by rpalethorpe about 1 month ago

The os-autoinst PR has been merged. So setting DEBUG_JSON_RPC on the job will log all of the RPCs. It's probably best to do it on a cloned job first.

Also available in: Atom PDF