action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

action #167051

## Observation 

 https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 

 ``` 
 monitor.qe.nue2.suse.org: 
     2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):... 
     2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors 
     telegraf errors 
 ``` 

 `systemctl status telegraf` on monitor says 

 ``` 
 ● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB 
      Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled) 
      Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago 
        Docs: https://github.com/influxdata/telegraf 
    Main PID: 1481 (telegraf) 
       Tasks: 21 (limit: 4915) 
         CPU: 8h 20min 48.515s 
      CGroup: /system.slice/telegraf.service 
              ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d 
              └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session 

 Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec> 
 Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval 
 Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808> 
 Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval 
 Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address 
 Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808> 
 Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address 
 Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec> 
 Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti> 
 Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c> 
 ``` 

 ## Acceptance criteria 
 * **AC1:** Significant reduction in errors in our CI pipelines 
 * **AC2:** Errors in business related tooling are still visible somewhere 

 ## Suggestions 
 * Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed 
 * Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep 
 * Report separate tickets about problems in business scripts 
 * Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target

Back

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

action #167051