Actions
action #157615
closed[alert] osd-deployment failed in post-deploy , telegraf errors size:M
Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2024-03-20
Due date:
2024-04-09
% Done:
0%
Estimated time:
Tags:
Description
See https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217
schort-server.qe.nue2.suse.org:
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
telegraf errors
monitor.qe.nue2.suse.org:
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
++ grep ' E! ' salt_post_deploy_checks.log
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':
2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':
2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
Suggestions¶
- Understand why and where
systemd_list_service_by_state_for_telegraf.sh
times out. It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores - "Error killing process: os: process already finished" might just be a consequence of the above
- "Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask https://www.ecosia.org/chat (or if that does not work invest in coal-powered https://www.cat-gpt.com/chat ) or something
- If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!
Updated by openqa_review 8 months ago
- Due date set to 2024-04-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler 8 months ago ยท Edited
- Status changed from In Progress to Feedback
No retry supported, but maybe increasing the timeout will help:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/x509_cert/README.md
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1135
Actions