action #167051
Updated by nicksinger about 2 months ago
## Observation https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 ``` monitor.qe.nue2.suse.org: 2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):... 2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors telegraf errors ``` `systemctl status telegraf` on monitor says ``` ● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled) Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago Docs: https://github.com/influxdata/telegraf Main PID: 1481 (telegraf) Tasks: 21 (limit: 4915) CPU: 8h 20min 48.515s CGroup: /system.slice/telegraf.service ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec> Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808> Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808> Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec> Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti> Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c> ``` ## Acceptance criteria * **AC1:** Significant reduction in errors in our CI pipelines * **AC2:** Errors in business related tooling are still visible somewhere ## Suggestions * ~~Look Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed~~ **DONE** journal output unrelated to pipeline result, most likely temporary outage needed * Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep * ~~Report Report separate tickets about problems in business scripts~~ **DONE** not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0 * ~~Consider Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target~~ **DONE** external scripts are already split out target