Project

General

Profile

Actions

action #121774

open

LTP cgroup test appears to crash OpenQA worker instance

Added by MDoucha over 1 year ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-12-09
Due date:
% Done:

0%

Estimated time:
Tags:

Description

LTP test cgroup_fj_stress_blkio_4_4_each on latest SLE-15SP1 KOTD kernel appears to crash the OpenQA worker instance it's running on. The test itself will succeed but the OpenQA job will stay stuck in wait_serial() for several hours (despite 90 second timeout) until the whole job fails on MAX_JOB_TIME. There are 3 examples so far:
https://openqa.suse.de/tests/10089424#step/cgroup_fj_stress_blkio_4_4_each/7
https://openqa.suse.de/tests/10111009#step/cgroup_fj_stress_blkio_4_4_each/7
https://openqa.suse.de/tests/10113099#step/cgroup_fj_stress_blkio_4_4_each/7

I've seen this issue only on SLE-15SP1 KOTD builds 156 and 157. I have not seen any cases on other SLE versions.

Typical autoinst-log.txt entries related to the timeout:

[2022-12-06T08:52:27.432374+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", output="", quiet=undef, timeout=30, die_on_timeout=1)
[2022-12-06T08:52:27.432549+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-06T08:52:27.432710+01:00] [debug] <<< testapi::wait_serial(record_output=undef, regexp="# ", quiet=undef, no_regex=1, buffer_size=undef, expect_not_found=0, timeout=90)
[2022-12-06T10:39:58.278597+01:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-12-06T10:39:58.278622+01:00] [debug] isotovideo received signal TERM
[2022-12-06T10:39:58.278748+01:00] [debug] backend got TERM

Expected result

Last good: 4.12.14-150100.156.1.gb6c27ee (or more recent)

Further details

Always latest result in this scenario: latest

Steps to reproduce:

  1. Run ltp_controllers testsuite on SLE-15SP1 KOTD
  2. Wait.
Actions #1

Updated by okurz over 1 year ago

  • Target version set to future

that looks like it's not crashing the worker instance even if maybe it should, right?

Please add more information at best following the openQA bug reporting template that you get when you click the "report bug" icon from openQA. Most important are "link to latest" and "steps to reproduce"

Actions #2

Updated by MDoucha over 1 year ago

  • Description updated (diff)
Actions #3

Updated by MDoucha over 1 year ago

okurz wrote:

that looks like it's not crashing the worker instance even if maybe it should, right?

wait_serial() clearly does not time out when it's supposed to.

Actions #4

Updated by MDoucha over 1 year ago

Or more precisely, the wait_serial() does not return when it should. The last wait_serial() that gets executed is waiting for # prompt and I can see the prompt in serial_terminal.txt. Yet the wait_serial() call ignores it and stays stuck indefinitely.

Actions #5

Updated by MDoucha over 1 year ago

Here's a new example on SLE-12SP4 so it's not just 15SP1 anymore:
https://openqa.suse.de/tests/10189630#step/cgroup_fj_stress_blkio_4_4_each/8

Once again, the job times out in wait_serial() waiting for shell prompt after the main test process finished. And the shell prompt is right there at the very end of serial_terminal.txt:
https://openqa.suse.de/tests/10189630/logfile?filename=serial_terminal.txt

Actions #7

Updated by rpalethorpe over 1 year ago

looking at a successful run of vmstat

[2022-12-16T14:39:06.501999+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, output="", timeout=30)
[2022-12-16T14:39:06.502157+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:06.502380+01:00] [debug] <<< testapi::wait_serial(record_output=undef, quiet=undef, buffer_size=undef, regexp="# ", timeout=90, no_regex=1, expect_not_found=0)
[2022-12-16T14:39:06.503503+01:00] [debug] <<< consoles::serial_screen::read_until(pattern=[
    "# "
  ], timeout=90, regexp="# ", record_output=undef, quiet=undef, expect_not_found=0, no_regex=1, json_cmd_token="NIYqbLbL", buffer_size=undef, cmd="backend_wait_serial")

read_until announces its entry. Now on an unsuccessful run:

[2022-12-16T14:39:11.582074+01:00] [debug] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, quiet=undef, timeout=30, output="")
[2022-12-16T14:39:11.582253+01:00] [debug] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2022-12-16T14:39:11.582491+01:00] [debug] <<< testapi::wait_serial(buffer_size=undef, regexp="# ", quiet=undef, record_output=undef, expect_not_found=0, no_regex=1, timeout=90)

here we just enter wait_serial on the front-end. It's not clear if the backend ever receives the command.

Actions #8

Updated by rpalethorpe over 1 year ago

Created a PR for printing the PID: https://github.com/os-autoinst/os-autoinst/pull/2237
myjsonrpc has a debug option which is set with an env variable: PERL_MYJSONRPC_DEBUG

It is extremely odd that it always happens on the same test, but the test is just moving some processes between CGroups. This is totally internal to the SUT except for the log output it generates, which is quite noisy.

Actions #9

Updated by rpalethorpe over 1 year ago

  • Status changed from New to In Progress

Theoretically the JSON message could be never being read if the select in the backend's main loop never returns the FD for the cmd pipe. I don't see how that is possible in practice though. It would be nice to rule this out, so to that end I have created this PR https://github.com/os-autoinst/os-autoinst/pull/2259

Actions #10

Updated by rpalethorpe over 1 year ago

Here is new output now that we have PID in log

[2023-02-02T02:53:41.010973+01:00] [debug] [pid:5845] <<< testapi::script_run(cmd="vmstat -w", quiet=undef, timeout=30, die_on_timeout=1, output="")
[2023-02-02T02:53:41.011182+01:00] [debug] [pid:5845] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-02-02T02:53:41.011473+01:00] [debug] [pid:5845] <<< testapi::wait_serial(buffer_size=undef, quiet=undef, no_regex=1, expect_not_found=0, timeout=90, record_output=undef, regexp="# ")

Timeout happens here. Testapi sends wait_serial, but we never see read_until being entered.

Below QEMU says it is killed by PID 1047 which is the worker process. This is confirmed by the worker log.

[2023-02-02T08:38:30.254667+01:00] [debug] [pid:5845] autotest received signal TERM, saving results of current test before exiting
[2023-02-02T08:38:30.256697+01:00] [debug] [pid:5644] isotovideo received signal TERM
[2023-02-02T08:38:30.259413+01:00] [debug] [pid:5911] backend got TERM
[2023-02-02T08:38:30.260873+01:00] [info] [pid:5911] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2023-02-02T08:38:30.268891+01:00] [debug] [pid:5644] commands process exited: 0
[2023-02-02T08:38:30.274473+01:00] [debug] [pid:5644] [autotest] process exited: 1
[2023-02-02T08:38:31.303807+01:00] [debug] [pid:5911] Passing remaining frames to the video encoder
[2023-02-02T08:38:31.304357+01:00] [debug] [pid:5911] Waiting for video encoder to finalize the video
[2023-02-02T08:38:31.304536+01:00] [debug] [pid:5911] The built-in video encoder (pid 5973) terminated
[2023-02-02T08:38:31.307130+01:00] [debug] [pid:5911] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 1047 (/usr/bin/perl)
[2023-02-02T08:38:31.313280+01:00] [debug] [pid:5911] sending magic and exit
[2023-02-02T08:38:31.314229+01:00] [debug] [pid:5644] received magic close
[2023-02-02T08:38:31.316686+01:00] [warn] [pid:5644] !!! OpenQA::Isotovideo::Runner::_read_response: THERE IS NOTHING TO READ 15 4 3
myjsonrpc: remote end terminated connection, stopping at /usr/lib/os-autoinst/myjsonrpc.pm line 43.
    myjsonrpc::send_json(GLOB(0x55e175a60750), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 117
    OpenQA::Isotovideo::CommandHandler::_send_to_cmd_srv(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8), HASH(0x55e17a153c78)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/CommandHandler.pm line 91
    OpenQA::Isotovideo::CommandHandler::stop_command_processing(OpenQA::Isotovideo::CommandHandler=HASH(0x55e1792c9df8)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 51
    OpenQA::Isotovideo::Runner::run(OpenQA::Isotovideo::Runner=HASH(0x55e172bf6d88)) called at /usr/bin/isotovideo line 143
[2023-02-02T08:38:31.319126+01:00] [debug] [pid:5644] stopping backend process 5911
[2023-02-02T08:38:31.520546+01:00] [debug] [pid:5644] done with backend process
[2023-02-02T08:38:31.520846+01:00] [debug] [pid:5644] stopping autotest process 5845
[2023-02-02T08:38:31.521024+01:00] [debug] [pid:5644] done with autotest process
5644: EXIT 0

There is no timeout set on the do_capture loop and the stall detection is not active with the serial console. So there is nothing to rule out that it is spinning in that loop.

Something else which could cause this, is if the testapi front end sends two messages without waiting for a response. Then the backend could read both draining the socket buffer. However the second one will not be read because it will be sitting in the JSON parser's buffer. This is only checked when the fd has data. It's hard to rule this out.

Actions #11

Updated by rpalethorpe over 1 year ago

The os-autoinst PR has been merged. So setting DEBUG_JSON_RPC on the job will log all of the RPCs. It's probably best to do it on a cloned job first.

Actions #13

Updated by rpalethorpe 10 months ago

  • Assignee set to rpalethorpe

This happened again (see link above).

After enabling DEBUG_JSON_RPC it passed. Which is not unexpected because the bug is random and slowing down the process with debug messages may prevent it from happening.

There are a number of things that could be done:

  1. Use kernel and uprobe tracing (assuming the later is practical with Perl)
  2. Start rewriting the code
  3. Replace the RPC code with a standard library

The first one requires root access to a worker instance. The later two are likely to result in more issues.

Actions #14

Updated by MDoucha 9 months ago

After enabling DEBUG_JSON_RPC on all ltp_controllers jobs, we got some slightly interesting output. Alas, it the true cause still remains a mystery.
https://openqa.suse.de/tests/12430237
https://openqa.suse.de/tests/12430237/logfile?filename=autoinst-log.txt

[2023-10-09T14:38:15.261431+02:00] [debug] [pid:114394] <<< testapi::script_run(cmd="vmstat -w", die_on_timeout=1, timeout=30, quiet=undef, output="")
[2023-10-09T14:38:15.261556+02:00] [debug] [pid:114394] tests/kernel/run_ltp.pm:334 called testapi::script_run
[2023-10-09T14:38:15.261682+02:00] [debug] [pid:114394] <<< testapi::wait_serial(regexp="# ", record_output=undef, timeout=90, buffer_size=undef, no_regex=1, quiet=undef, expect_not_found=0)
[2023-10-09T14:38:15.261800+02:00] [debug] [pid:114394] send_json(5) JSON={"buffer_size":null,"cmd":"backend_wait_serial","expect_not_found":0,"json_cmd_token":"LlLeAPyt","no_regex":1,"quiet":null,"record_output":null,"regexp":"# ","timeout":90}
[2023-10-09T14:38:15.261865+02:00] [debug] [pid:114394] read_json(5)
[2023-10-09T20:27:51.091266+02:00] [debug] [pid:114394] autotest received signal TERM, saving results of current test before exiting
Actions #15

Updated by MDoucha about 1 month ago

A few weeks ago, the fanotify14 test running on SLE-15SP4 and SLE-15SP5 VMs started triggering the same issue on x86_64 workers:
https://openqa.suse.de/tests/14666205#step/fanotify14/1
https://openqa.suse.de/tests/14666752#step/fanotify14/1

Actions

Also available in: Atom PDF