Project

General

Profile

Actions

action #157615

closed

[alert] osd-deployment failed in post-deploy , telegraf errors size:M

Added by jbaier_cz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2024-03-20
Due date:
2024-04-09
% Done:

0%

Estimated time:

Description

See https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217

schort-server.qe.nue2.suse.org:
    2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
    2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
    2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""': 
    2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""': 
    2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
    telegraf errors
monitor.qe.nue2.suse.org:
    2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
    2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    telegraf errors
++ grep ' E! ' salt_post_deploy_checks.log
    2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
    2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished
    2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""': 
    2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""': 
    2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors
    2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout
    2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors

Suggestions

  1. Understand why and where systemd_list_service_by_state_for_telegraf.sh times out. It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores
  2. "Error killing process: os: process already finished" might just be a consequence of the above
  3. "Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask https://www.ecosia.org/chat (or if that does not work invest in coal-powered https://www.cat-gpt.com/chat ) or something
  4. If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!
Actions

Also available in: Atom PDF