Project

General

Profile

Actions

action #87883

closed

osd infrastructure: services like "telegraf" are not enabled to start immediately on boot

Added by okurz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2021-01-18
Due date:
2021-01-26
% Done:

0%

Estimated time:

Description

See https://chat.suse.de/group/qa-tools?msg=2WKi5LKkFxC5qwcsB

strange, now on openqaworker10 telegraf was also disabled. will check another host
well, disabled and hence not started on boot.
I think this can explain why the weekly reboot caused a delay of longer than 1h until hosts came back because in salt we never ensure that the service is enabled, we just ensure "running", not sure when salt is even triggered and I wonder why they even show up after 1h
Actions #1

Updated by okurz almost 4 years ago

  • Status changed from In Progress to Feedback
Actions #2

Updated by okurz almost 4 years ago

  • Due date set to 2021-01-26

merged. This could have explain the problems that hosts only report online again after about 1h in grafana. I should wait until next automatic (or manual) reboots of machines and check the "host up" monitoring panels on grafana. Maybe we can reduce again the alerting time period to a sensible lower selection after confirming that hosts are reported as "up" sooner again.

Actions #3

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?viewPanel=65105&orgId=1&from=1611455044874&to=1611456238646 is an example showing that now the "host up" receives no data for only 7 minutes which is much more near to what we would have expected. I have crosschecked all workers for the "host up" history of past 7 days and with exceptions for the known problematic machines all rebooted automatically within around 7 minutes so reducing the alert period would be possible. However I think it's ok to keep a bit grace time in case someone wants to conduct manual actions on a machine with reboot.

Actions

Also available in: Atom PDF