action #167051
Updated by okurz about 2 months ago
## Observation
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145
```
monitor.qe.nue2.suse.org:
2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
```
`systemctl status telegraf` on monitor says
```
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
Docs: https://github.com/influxdata/telegraf
Main PID: 1481 (telegraf)
Tasks: 21 (limit: 4915)
CPU: 8h 20min 48.515s
CGroup: /system.slice/telegraf.service
├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
└─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session
Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>
```
## Acceptance criteria
* **AC1:** Significant reduction in errors in our CI pipelines
* **AC2:** Errors in business related tooling are still visible somewhere
## Suggestions
* Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed
* Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
* Report separate tickets about problems in business scripts
* Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target