Project

General

Profile

Actions

action #96795

closed

CPU Load alert and telegraf going between 41%, 98.5% and 115% CPU

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2021-08-12
Due date:
2021-08-28
% Done:

0%

Estimated time:

Description

Observation

  • CPU Load alert triggered
  • htop shows that telegraf is going up and down
  • telegraf still spiking regardless of the alert being OK

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #96807: Web UI is slow and Apache Response Time alert got triggeredResolvedokurz2021-08-122021-10-01

Actions
Actions #1

Updated by livdywan over 2 years ago

I saw the CPU alert trigger in the meantime, but I couldn't confirm a correlation with Telegraf's spikes which just seem to continue.

Switched it off temporarily for testing via sudo systemctl disable --now telegraf and I see openqa processes maxing out at 54% as the worst offenders now. SWitching it back on via enable the spikes are back fully.

Actions #2

Updated by livdywan over 2 years ago

Also checked the /debug/vars route although I couldn't see anything useful there and reverted the logwarn support which wasn't correctly configured but that also didn't affect CPU usage.

Actions #3

Updated by livdywan over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to livdywan

In the weekly it was suggested we could reduceincrease the interval which is currently 10s on the web UI and 1m on workers. We also have 0s jitter, which upstream docs recommend be used to spread out plugins for better performance. MR here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/548

Actions #4

Updated by openqa_review over 2 years ago

  • Due date set to 2021-08-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions #5

Updated by livdywan over 2 years ago

Another suggestion from the extended daily was to adjust the nice level via salt (maybe via Exec or salt states if possible), so I'll prepare an MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/551

Actions #6

Updated by okurz over 2 years ago

merged. I did on osd systemctl daemon-reload && systemctl restart telegraf. I can confirm that telegraf runs with nice-level 10 now. What's next?

Actions #7

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Feedback

okurz wrote:

merged. I did on osd systemctl daemon-reload && systemctl restart telegraf. I can confirm that telegraf runs with nice-level 10 now. What's next?

It looks like telegraf is on average using a lot less CPU than before, so I'm inclined to consider this a success.

Actions #8

Updated by livdywan over 2 years ago

  • Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added
Actions #9

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

cdywan wrote:

It looks like telegraf is on average using a lot less CPU than before, so I'm inclined to consider this a success.

Hence resolving.

Actions

Also available in: Atom PDF