action #96795
closedCPU Load alert and telegraf going between 41%, 98.5% and 115% CPU
0%
Description
Observation¶
- CPU Load alert triggered
- htop shows that
telegraf
is going up and down - telegraf still spiking regardless of the alert being OK
Updated by livdywan over 3 years ago
I saw the CPU alert trigger in the meantime, but I couldn't confirm a correlation with Telegraf's spikes which just seem to continue.
Switched it off temporarily for testing via sudo systemctl disable --now telegraf
and I see openqa processes maxing out at 54% as the worst offenders now. SWitching it back on via enable
the spikes are back fully.
Updated by livdywan over 3 years ago
Also checked the /debug/vars route although I couldn't see anything useful there and reverted the logwarn support which wasn't correctly configured but that also didn't affect CPU usage.
Updated by livdywan over 3 years ago
- Status changed from New to In Progress
- Assignee set to livdywan
In the weekly it was suggested we could reduceincrease the interval which is currently 10s on the web UI and 1m on workers. We also have 0s jitter, which upstream docs recommend be used to spread out plugins for better performance. MR here: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/548
Updated by openqa_review over 3 years ago
- Due date set to 2021-08-28
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
Another suggestion from the extended daily was to adjust the nice level via salt (maybe via Exec or salt states if possible), so I'll prepare an MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/551
Updated by okurz over 3 years ago
merged. I did on osd systemctl daemon-reload && systemctl restart telegraf
. I can confirm that telegraf runs with nice-level 10 now. What's next?
Updated by livdywan over 3 years ago
- Status changed from In Progress to Feedback
okurz wrote:
merged. I did on osd
systemctl daemon-reload && systemctl restart telegraf
. I can confirm that telegraf runs with nice-level 10 now. What's next?
It looks like telegraf is on average using a lot less CPU than before, so I'm inclined to consider this a success.
Updated by livdywan over 3 years ago
- Related to action #96807: Web UI is slow and Apache Response Time alert got triggered added
Updated by livdywan over 3 years ago
- Status changed from Feedback to Resolved
cdywan wrote:
It looks like telegraf is on average using a lot less CPU than before, so I'm inclined to consider this a success.
Hence resolving.