action #57662

monitoring considered harmful

Added by coolo 5 months ago. Updated 5 months ago.

Status:ResolvedStart date:03/10/2019
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Done
Difficulty:
Duration:

Description

Since osd runs a lot of ltp tests for kernel live patches, we noticed a dramatic slowdown in uploading results. Stracing the mojo workers, they were doing a log of systemv ipc and it was unclear why.
Attaching gdb to it, it was found that this is all triggered by IPC::ShareLite used by Mojolicious::Plugin::Status. I removed both and disabled the monitoring option in openqa.ini

Before:

Oct 03 07:16:14 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-40.txt
Oct 03 07:16:17 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-41.txt
Oct 03 07:16:19 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-42.txt
Oct 03 07:16:28 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-43.txt
Oct 03 07:16:34 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-44.txt
Oct 03 07:16:37 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-45.txt
Oct 03 07:16:41 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-46.txt
Oct 03 07:16:41 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-47.txt
Oct 03 07:16:43 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-48.txt
Oct 03 07:16:50 openqaworker8 worker[18535]: [debug] [pid:23592] Uploading artefact boot_ltp-49.txt

After:

Oct 03 07:21:17 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-41.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-42.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-43.txt
Oct 03 07:21:18 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-44.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-45.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-46.txt
Oct 03 07:21:19 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-47.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-48.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-49.txt
Oct 03 07:21:20 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-50.txt
Oct 03 07:21:21 openqaworker8 worker[18535]: [debug] [pid:25212] Uploading artefact shutdown_ltp-51.txt

I guess either this monitor plugin can be implemented without locking or we need to remove the option.

History

#1 Updated by coolo 5 months ago

I even had to reduce the number of mojolicious workers to 20 as we now utilize our full potential we were suddenly CPU bound. It's hard to swallow that we drove all the time with applied handbrake

#2 Updated by okurz 5 months ago

That also explains why we could have never seen this problems on o3 where I just suspected it's due to lower load.

#3 Updated by mkittler 5 months ago

Good that you've found out. I'm curious how you utilized gdb? "Simply" attached to the process like it was a C/C++ program with debug symbols? But that usually doesn't provide a lot of information. Did you deduce the name of the relevant Perl module only from the C call stack? I'm also not sure how to actually use the perl-debuginfo package.

#4 Updated by coolo 5 months ago

IPC::ShareLite is implemented in C, so there wasn't much guess work required.

#5 Updated by mkittler 5 months ago

Ah, that makes things simpler.

After the discussion in the chat I suppose the best solution is to simply keep the plugin disabled for now. So can the issue be closed again?

I've also created a PR to add a note about the harmfulness of the plugin: https://github.com/os-autoinst/openQA/pull/2377

#6 Updated by cdywan 5 months ago

mkittler wrote:

After the discussion in the chat I suppose the best solution is to simply keep the plugin disabled for now. So can the issue be closed again?


I've also created a PR to add a note about the harmfulness of the plugin: https://github.com/os-autoinst/openQA/pull/2377

I'm not sure with regard to the goal of this ticket - should the plugin fundamentally be considered expensive, or is it worth trying to optimize it?

#7 Updated by coolo 5 months ago

The /monitoring route is considered nice to have - so if the monitoring is hard to get, ditch it.

#8 Updated by okurz 5 months ago

  • Status changed from New to Resolved
  • Assignee set to mkittler

#9 Updated by coolo 5 months ago

  • Target version changed from Ready to Done

Also available in: Atom PDF