Project

General

Profile

action #157615

Updated by okurz about 2 months ago

See https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/2411217 

 ``` 
 schort-server.qe.nue2.suse.org: 
     2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished 
     2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished 
     2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':  
     2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':  
     2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors 
     telegraf errors 
 monitor.qe.nue2.suse.org: 
     2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout 
     2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors 
     telegraf errors 
 ++ grep ' E! ' salt_post_deploy_checks.log 
     2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished 
     2024-03-20T16:23:32Z E! [agent] Error killing process: os: process already finished 
     2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state masked --exclude ""':  
     2024-03-20T16:23:32Z E! [inputs.exec] Error in plugin: exec: command timed out for command '/etc/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh --state failed --exclude ""':  
     2024-03-20T16:23:32Z E! [telegraf] Error running agent: input plugins recorded 2 errors 
     2024-03-20T16:23:31Z E! [inputs.x509_cert] Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout 
     2024-03-20T16:23:35Z E! [telegraf] Error running agent: input plugins recorded 1 errors 
 ``` 

 ## Suggestions 
 1. Understand why and where `systemd_list_service_by_state_for_telegraf.sh` times out.    It could be the general telegraf-timeout in the pipeline, in the execution of the script itself (from telegraf.conf) or another place. Adjust the timeout to match expected runtime or fix the script to complete faster -> schort-server only has 1 VM core, consider configuring the hypervisor to use at least 2 cores 
 2. "Error killing process: os: process already finished" might just be a consequence of the above 
 3. "Error in plugin: cannot get SSL cert 'https://monitor.qa.suse.de:443': dial tcp: lookup monitor.qa.suse.de: i/o timeout" possibly to be covered with some retrying? Investigate what the real error message means, ask https://www.ecosia.org/chat (or if that does not work invest in coal-powered https://www.cat-gpt.com/chat ) or something 
 4. If we cannot solve these problems, consider excluding them from CI execution to avoid false-positives. Consider the impact of doing this first however!

Back