Project

General

Profile

Actions

action #167051

closed

coordination #161414: [epic] Improved salt based infrastructure management

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Added by okurz 2 months ago. Updated 5 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145

monitor.qe.nue2.suse.org:
    2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
    2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    telegraf errors

systemctl status telegraf on monitor says

● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 1481 (telegraf)
      Tasks: 21 (limit: 4915)
        CPU: 8h 20min 48.515s
     CGroup: /system.slice/telegraf.service
             ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
             └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>

Acceptance criteria

  • AC1: Significant reduction in errors in our CI pipelines
  • AC2: Errors in business related tooling are still visible somewhere

Suggestions

  • Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed DONE journal output unrelated to pipeline result, most likely temporary outage
  • Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
  • Report separate tickets about problems in business scripts DONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0
  • Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target DONE external scripts are already split out

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:SResolvedgpathak2024-10-02

Actions
Has duplicate openQA Infrastructure - action #168475: salt-states-openqa telegraf pipeline failing with error in libcryptoRejectedlivdywan

Actions
Copied to openQA Infrastructure - action #168145: implement telegraf health check and adjust according pipelinesNew

Actions
Actions

Also available in: Atom PDF