action #167051: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167051

closed

coordination #161414: [epic] Improved salt based infrastructure management

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Added by okurz 7 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

High

Assignee:

nicksinger

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-09-19

Due date:

% Done:

Estimated time:

Tags:

gitlab, influxdb, grafana, infra, telegraf

Description

Observation¶

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145

monitor.qe.nue2.suse.org:
    2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
    2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    telegraf errors

systemctl status telegraf on monitor says

● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 1481 (telegraf)
      Tasks: 21 (limit: 4915)
        CPU: 8h 20min 48.515s
     CGroup: /system.slice/telegraf.service
             ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
             └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>

Acceptance criteria¶

AC1: Significant reduction in errors in our CI pipelines
AC2: Errors in business related tooling are still visible somewhere

Suggestions¶

~~Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed~~ DONE journal output unrelated to pipeline result, most likely temporary outage
Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
~~Report separate tickets about problems in business scripts~~ DONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0
Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target DONE external scripts are already split out

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by okurz 7 months ago

Subject changed from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by nicksinger 7 months ago

Status changed from Workable to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by openqa_review 7 months ago

Due date set to 2024-10-16

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by nicksinger 7 months ago

The problematic action triggering the failed pipeline is https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L82 - we run telegraf with "--test" which according to the documentation:

enable test mode: gather metrics, print them out, and exit. Note: Test mode only runs inputs, not processors, aggregators, or outputs

so the influx errors we see in the journal are related to runtime and not the pipeline. Regarding the execution errors; I think telegraf is handling this fine. It fails with an according log message and continues operation. We introduced https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805 to resolve https://progress.opensuse.org/issues/125765. In my opinion we should move this check for empty/missing slo data in grafana. Having telegraf properly running after deployment should be covered by https://progress.opensuse.org/issues/167728

Actions

Copy link

Updated by nicksinger 7 months ago

Related to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added

Actions

Copy link

Updated by nicksinger 7 months ago

Description updated (diff)

Actions

Copy link

Updated by nicksinger 7 months ago

Discussed with @okurz in jitsi how to continue here and we noted down some points together (either precursor or follow-up tasks or part of the ticket):

Hackweek project log aggregation so that we can alert ourselves on errors in logs properly and consistently for all relevant services
Split telegraf between (between what? -> research) to avoid mixing runtime/critical errors. This also can allows us to treat all errors as critical in a critical domain and be lenient in business process related services
Consider to gather telegraf log errors, (could be to execute telegraf -test also periodically e.g. in pipeline schedule in https://gitlab.suse.de/openqa/salt-states-openqa/-/pipeline_schedules or logwarn) so that we can see telegraf errors that are not related to merge requests so that we are not confused in failing CI checks in open merge requests about unrelated problems

Actions

Copy link

Updated by nicksinger 7 months ago

despite the more general discussion I think the currently easiest approach is to only run this step if actual telegraf configs were changed. I tried to add this with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 but fail to understand the mentioned error:

    jobs:telegraf config key may not be used with `rules`: only

this might be related to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L17-19 (which apparently is deprecated?). But I also failed to quickly convert the logic behind these two lines into a rules-statement.

Actions

Copy link

Updated by nicksinger 7 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 to limit telegraf runs to only changes to its config. I was also reading https://www.influxdata.com/blog/telegraf-best-practices/ which mentions multiple instances regarding stability but mainly to avoid loosing all data (which we currently don't cover and never saw huge problems with). It also doesn't really go into detail how to handle warnings/errors in the logs. But they mention the https://github.com/influxdata/telegraf/tree/master/plugins/outputs/health#health-output-plugin which seems very powerful and would allow us to request a status in pipelines while also defining exceptions. I will try to implement a simple example.

Actions

Copy link

#10

Updated by okurz 7 months ago

Parent task set to #161414

please put the "hack week project ideas" and "health-output-plugin" as new tickets within the parent #161414

Actions

Copy link

#11

Updated by nicksinger 7 months ago

Copied to action #168145: implement telegraf health check and adjust according pipelines added

Actions

Copy link

#12

Updated by nicksinger 7 months ago

Status changed from In Progress to Resolved

Created https://progress.opensuse.org/issues/168148 and https://progress.opensuse.org/issues/168145 - work here should be covered by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 and limiting the amount of total pipeline runs

Actions

Copy link

#13

Updated by okurz 7 months ago

Status changed from Resolved to Feedback

the below seems to be related to your changes. I pushed changes to my fork in preparation for an MR and then post-deploy checks were triggered and seemingly immediately failing. Note that the according CI jobs in the MR were completely fine. Can you please look into that

-------- Forwarded Message --------
Subject: salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
Date: Sat, 12 Oct 2024 12:13:08 +0000
From: GitLab@SUSE gitlab@suse.de
Reply-To: GitLab@SUSE gitlab@suse.de
To: okurz@suse.de

salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
GitLab
✖ Pipeline #1357922 has failed!

Project Oliver Kurz https://gitlab.suse.de/okurz / salt-pillars-openqa https://gitlab.suse.de/okurz/salt-pillars-openqa
Branch
feature/cc_areas https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commits/feature/cc_areas

Commit
2ed63adf https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commit/2ed63adfd0f326010f0c4a9f7532a111abc9e4d2

Add explicit "zone-cc" for all workers in commo...
Commit Author
Oliver Kurz https://gitlab.suse.de/okurz

Pipeline #1357922 https://gitlab.suse.de/okurz/salt-pillars-openqa/-/pipelines/1357922 triggered by Oliver Kurz https://gitlab.suse.de/okurz

had 1 failed job
Failed job
✖ post-deploy

telegraf <https://gitlab.suse.de/okurz/salt-pillars-openqa/-/jobs/3225392>

Actions

Copy link

#14

Updated by jbaier_cz 7 months ago

The post-deploy telegraf job is already using the rules section (so the definition from defaults is overriden), https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L89 needs to be updated with the extra rules as well.

Actions

Copy link

#15

Updated by okurz 7 months ago

Status changed from Feedback to Workable

same problem as I observed in my MRs is also in the MR by gpathak https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/918

Actions

Copy link

#16

Updated by nicksinger 7 months ago

Status changed from Workable to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919

Actions

Copy link

#17

Updated by nicksinger 7 months ago

Status changed from Feedback to In Progress

nicksinger wrote in #note-16:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919

found another issue in my own MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1288 - not sure why it got scheduled there…

Actions

Copy link

#18

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919 merged

Actions

Copy link

#19

Updated by nicksinger 7 months ago

Status changed from In Progress to Feedback

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Actions

Copy link

#20

Updated by livdywan 7 months ago

Due date changed from 2024-10-16 to 2024-10-25

nicksinger wrote in #note-19:

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Bumping the due date accordingly

Actions

Copy link

#21

Updated by livdywan 6 months ago

Has duplicate action #168475: salt-states-openqa telegraf pipeline failing with error in libcrypto added

Actions

Copy link

#22

Updated by nicksinger 6 months ago

Due date changed from 2024-10-25 to 2024-11-01

livdywan wrote in #note-20:

nicksinger wrote in #note-19:

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Bumping the due date accordingly

Had no chance yet to talk with Jan about this so bumping again by another week.

Actions

Copy link

#23

Updated by livdywan 6 months ago

okurz wrote in #note-18:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919 merged

Reading the upstream docs on rules again, I noticed:

if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH: If changes are pushed to the default branch. Use when you want to have the same configuration in multiple projects with different default branches.

The MR uses $CI_COMMIT_REF_NAME. The again docs on predefined variables seem to confirm why we need it:

CI_COMMIT_BRANCH Pre-pipeline The commit branch name. Available in branch pipelines, including pipelines for the default branch. Not available in merge request pipelines or tag pipelines.
CI_COMMIT_REF_NAME Pre-pipeline The branch or tag name for which project is built.

Which seems like it should be correct but I can't find an example of our use case.

Actions

Copy link

#24

Updated by livdywan 6 months ago

Due date changed from 2024-11-01 to 2024-11-08

Bumping to account for availability.

Actions

Copy link

#25

Updated by livdywan 6 months ago

Due date changed from 2024-11-08 to 2024-11-15
Priority changed from Normal to High

Perhaps we can wrap this up this week? Hopefully everyone involved is available now.

Actions

Copy link

#26

Updated by nicksinger 6 months ago

livdywan wrote in #note-25:

Perhaps we can wrap this up this week? Hopefully everyone involved is available now.

I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.

Actions

Copy link

#27

Updated by nicksinger 6 months ago

Status changed from Feedback to Resolved

nicksinger wrote in #note-26:

I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.

merged. I tested with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1307 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/930 and they at least don't explode immediately. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1308 looks good but I don't want to merge a useless commit (and removal again) just to test a schedule. At least the previous issue of having jobs on forks is away now.

Actions

Copy link

#28

Updated by okurz 6 months ago

Due date deleted (~~2024-11-15~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167051

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by openqa_review 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by jbaier_cz 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by nicksinger 7 months ago

Updated by okurz 7 months ago

Updated by nicksinger 7 months ago

Updated by livdywan 7 months ago

Updated by livdywan 6 months ago

Updated by nicksinger 6 months ago

Updated by livdywan 6 months ago

Updated by livdywan 6 months ago

Updated by livdywan 6 months ago

Updated by nicksinger 6 months ago

Updated by nicksinger 6 months ago

Updated by okurz 6 months ago