action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167051

closed

coordination #161414: [epic] Improved salt based infrastructure management

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Added by okurz 7 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-09-19

Due date:

% Done:

Estimated time:

Tags:

gitlab, influxdb, grafana, infra, telegraf

Description

Observation¶

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145

monitor.qe.nue2.suse.org:
    2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
    2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    telegraf errors

systemctl status telegraf on monitor says

● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 1481 (telegraf)
      Tasks: 21 (limit: 4915)
        CPU: 8h 20min 48.515s
     CGroup: /system.slice/telegraf.service
             ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
             └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>

Acceptance criteria¶

AC1: Significant reduction in errors in our CI pipelines
AC2: Errors in business related tooling are still visible somewhere

Suggestions¶

~~Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed~~ DONE journal output unrelated to pipeline result, most likely temporary outage
Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
~~Report separate tickets about problems in business scripts~~ DONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0
Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target DONE external scripts are already split out

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167051

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by jbaier_cz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 6 months ago

Updated by nicksinger 6 months ago

Updated by livdywan 6 months ago

Updated by livdywan 6 months ago

Updated by livdywan 6 months ago

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by okurz 6 months ago