action #96795: CPU Load alert and telegraf going between 41%, 98.5% and 115% CPU - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #96795

closed

CPU Load alert and telegraf going between 41%, 98.5% and 115% CPU

Added by livdywan almost 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

livdywan

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-08-12

Due date:

2021-08-28

% Done:

Estimated time:

Description

Observation¶

CPU Load alert triggered
htop shows that telegraf is going up and down
telegraf still spiking regardless of the alert being OK

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by livdywan almost 4 years ago

I saw the CPU alert trigger in the meantime, but I couldn't confirm a correlation with Telegraf's spikes which just seem to continue.

Switched it off temporarily for testing via sudo systemctl disable --now telegraf and I see openqa processes maxing out at 54% as the worst offenders now. SWitching it back on via enable the spikes are back fully.

Actions

Copy link

Updated by livdywan almost 4 years ago

Also checked the /debug/vars route although I couldn't see anything useful there and reverted the logwarn support which wasn't correctly configured but that also didn't affect CPU usage.

Actions

Copy link

Updated by livdywan almost 4 years ago

Status changed from New to In Progress
Assignee set to livdywan

In the weekly it was suggested we could ~~reduce~~increase the interval which is currently 10s on the web UI and 1m on workers. We also have 0s jitter, which upstream docs recommend be used to spread out plugins for better performance. MR here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/548

Actions

Copy link

Updated by openqa_review almost 4 years ago

Due date set to 2021-08-28

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan almost 4 years ago

Another suggestion from the extended daily was to adjust the nice level via salt (maybe via Exec or salt states if possible), so I'll prepare an MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/551

Actions

Copy link