Actions
action #167051
closedcoordination #161414: [epic] Improved salt based infrastructure management
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S
Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-19
Due date:
% Done:
0%
Estimated time:
Description
Observation¶
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145
monitor.qe.nue2.suse.org:
2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
systemctl status telegraf
on monitor says
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
Docs: https://github.com/influxdata/telegraf
Main PID: 1481 (telegraf)
Tasks: 21 (limit: 4915)
CPU: 8h 20min 48.515s
CGroup: /system.slice/telegraf.service
├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
└─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session
Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>
Acceptance criteria¶
- AC1: Significant reduction in errors in our CI pipelines
- AC2: Errors in business related tooling are still visible somewhere
Suggestions¶
Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is neededDONE journal output unrelated to pipeline result, most likely temporary outage- Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
Report separate tickets about problems in business scriptsDONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal targetDONE external scripts are already split out
Actions