Project

General

Profile

Actions

action #57662

closed

monitoring considered harmful

Added by coolo about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-10-03
Due date:
% Done:

0%

Estimated time:

Description

Since osd runs a lot of ltp tests for kernel live patches, we noticed a dramatic slowdown in uploading results. Stracing the mojo workers, they were doing a log of systemv ipc and it was unclear why.
Attaching gdb to it, it was found that this is all triggered by IPC::ShareLite used by Mojolicious::Plugin::Status. I removed both and disabled the monitoring option in openqa.ini

Before:

Oct 03 07:16:14 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-40.txt
Oct 03 07:16:17 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-41.txt
Oct 03 07:16:19 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-42.txt
Oct 03 07:16:28 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-43.txt
Oct 03 07:16:34 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-44.txt
Oct 03 07:16:37 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-45.txt
Oct 03 07:16:41 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-46.txt
Oct 03 07:16:41 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-47.txt
Oct 03 07:16:43 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-48.txt
Oct 03 07:16:50 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-49.txt

After:

Oct 03 07:21:17 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-41.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-42.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-43.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-44.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-45.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-46.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-47.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-48.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-49.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-50.txt
Oct 03 07:21:21 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-51.txt

I guess either this monitor plugin can be implemented without locking or we need to remove the option.

Actions #1

Updated by coolo about 5 years ago

I even had to reduce the number of mojolicious workers to 20 as we now utilize our full potential we were suddenly CPU bound. It's hard to swallow that we drove all the time with applied handbrake

Actions #2

Updated by okurz about 5 years ago

That also explains why we could have never seen this problems on o3 where I just suspected it's due to lower load.

Actions #3

Updated by mkittler about 5 years ago

Good that you've found out. I'm curious how you utilized gdb? "Simply" attached to the process like it was a C/C++ program with debug symbols? But that usually doesn't provide a lot of information. Did you deduce the name of the relevant Perl module only from the C call stack? I'm also not sure how to actually use the perl-debuginfo package.

Actions #4

Updated by coolo about 5 years ago

IPC::ShareLite is implemented in C, so there wasn't much guess work required.

Actions #5

Updated by mkittler about 5 years ago

Ah, that makes things simpler.

After the discussion in the chat I suppose the best solution is to simply keep the plugin disabled for now. So can the issue be closed again?

I've also created a PR to add a note about the harmfulness of the plugin: https://github.com/os-autoinst/openQA/pull/2377

Actions #6

Updated by livdywan about 5 years ago

mkittler wrote:

After the discussion in the chat I suppose the best solution is to simply keep the plugin disabled for now. So can the issue be closed again?

I've also created a PR to add a note about the harmfulness of the plugin: https://github.com/os-autoinst/openQA/pull/2377

I'm not sure with regard to the goal of this ticket - should the plugin fundamentally be considered expensive, or is it worth trying to optimize it?

Actions #7

Updated by coolo about 5 years ago

The /monitoring route is considered nice to have - so if the monitoring is hard to get, ditch it.

Actions #8

Updated by okurz about 5 years ago

  • Status changed from New to Resolved
  • Assignee set to mkittler
Actions #9

Updated by coolo about 5 years ago

  • Target version changed from Ready to Done
Actions

Also available in: Atom PDF