osd infrastructure: services like "telegraf" are not enabled to start immediately on boot
strange, now on openqaworker10 telegraf was also disabled. will check another host well, disabled and hence not started on boot. I think this can explain why the weekly reboot caused a delay of longer than 1h until hosts came back because in salt we never ensure that the service is enabled, we just ensure "running", not sure when salt is even triggered and I wonder why they even show up after 1h
- Status changed from In Progress to Feedback
- Due date set to 2021-01-26
merged. This could have explain the problems that hosts only report online again after about 1h in grafana. I should wait until next automatic (or manual) reboots of machines and check the "host up" monitoring panels on grafana. Maybe we can reduce again the alerting time period to a sensible lower selection after confirming that hosts are reported as "up" sooner again.
- Status changed from Feedback to Resolved
https://stats.openqa-monitor.qa.suse.de/d/WDopenqaworker8/worker-dashboard-openqaworker8?viewPanel=65105&orgId=1&from=1611455044874&to=1611456238646 is an example showing that now the "host up" receives no data for only 7 minutes which is much more near to what we would have expected. I have crosschecked all workers for the "host up" history of past 7 days and with exceptions for the known problematic machines all rebooted automatically within around 7 minutes so reducing the alert period would be possible. However I think it's ok to keep a bit grace time in case someone wants to conduct manual actions on a machine with reboot.