action #167051
closedcoordination #161414: [epic] Improved salt based infrastructure management
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S
0%
Description
Observation¶
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145
monitor.qe.nue2.suse.org:
2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
telegraf errors
systemctl status telegraf
on monitor says
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
Docs: https://github.com/influxdata/telegraf
Main PID: 1481 (telegraf)
Tasks: 21 (limit: 4915)
CPU: 8h 20min 48.515s
CGroup: /system.slice/telegraf.service
├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
└─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session
Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>
Acceptance criteria¶
- AC1: Significant reduction in errors in our CI pipelines
- AC2: Errors in business related tooling are still visible somewhere
Suggestions¶
Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is neededDONE journal output unrelated to pipeline result, most likely temporary outage- Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
Report separate tickets about problems in business scriptsDONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal targetDONE external scripts are already split out
Updated by okurz about 2 months ago
- Subject changed from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by openqa_review about 2 months ago
- Due date set to 2024-10-16
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger about 2 months ago
The problematic action triggering the failed pipeline is https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L82 - we run telegraf with "--test" which according to the documentation:
enable test mode: gather metrics, print them out, and exit. Note: Test mode only runs inputs, not processors, aggregators, or outputs
so the influx errors we see in the journal are related to runtime and not the pipeline. Regarding the execution errors; I think telegraf is handling this fine. It fails with an according log message and continues operation. We introduced https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805 to resolve https://progress.opensuse.org/issues/125765. In my opinion we should move this check for empty/missing slo data in grafana. Having telegraf properly running after deployment should be covered by https://progress.opensuse.org/issues/167728
Updated by nicksinger about 2 months ago
- Related to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added
Updated by nicksinger about 1 month ago
Discussed with @okurz in jitsi how to continue here and we noted down some points together (either precursor or follow-up tasks or part of the ticket):
- Hackweek project log aggregation so that we can alert ourselves on errors in logs properly and consistently for all relevant services
- Split telegraf between (between what? -> research) to avoid mixing runtime/critical errors. This also can allows us to treat all errors as critical in a critical domain and be lenient in business process related services
- Consider to gather telegraf log errors, (could be to execute
telegraf -test
also periodically e.g. in pipeline schedule in https://gitlab.suse.de/openqa/salt-states-openqa/-/pipeline_schedules or logwarn) so that we can see telegraf errors that are not related to merge requests so that we are not confused in failing CI checks in open merge requests about unrelated problems
Updated by nicksinger about 1 month ago
despite the more general discussion I think the currently easiest approach is to only run this step if actual telegraf configs were changed. I tried to add this with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 but fail to understand the mentioned error:
jobs:telegraf config key may not be used with `rules`: only
this might be related to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L17-19 (which apparently is deprecated?). But I also failed to quickly convert the logic behind these two lines into a rules-statement.
Updated by nicksinger about 1 month ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 to limit telegraf runs to only changes to its config. I was also reading https://www.influxdata.com/blog/telegraf-best-practices/ which mentions multiple instances regarding stability but mainly to avoid loosing all data (which we currently don't cover and never saw huge problems with). It also doesn't really go into detail how to handle warnings/errors in the logs. But they mention the https://github.com/influxdata/telegraf/tree/master/plugins/outputs/health#health-output-plugin which seems very powerful and would allow us to request a status in pipelines while also defining exceptions. I will try to implement a simple example.
Updated by okurz about 1 month ago
- Parent task set to #161414
please put the "hack week project ideas" and "health-output-plugin" as new tickets within the parent #161414
Updated by nicksinger about 1 month ago
- Copied to action #168145: implement telegraf health check and adjust according pipelines added
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Resolved
Created https://progress.opensuse.org/issues/168148 and https://progress.opensuse.org/issues/168145 - work here should be covered by https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 and limiting the amount of total pipeline runs
Updated by okurz about 1 month ago
- Status changed from Resolved to Feedback
the below seems to be related to your changes. I pushed changes to my fork in preparation for an MR and then post-deploy checks were triggered and seemingly immediately failing. Note that the according CI jobs in the MR were completely fine. Can you please look into that
-------- Forwarded Message --------
Subject: salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
Date: Sat, 12 Oct 2024 12:13:08 +0000
From: GitLab@SUSE gitlab@suse.de
Reply-To: GitLab@SUSE gitlab@suse.de
To: okurz@suse.de
salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
GitLab
✖ Pipeline #1357922 has failed!
Project Oliver Kurz https://gitlab.suse.de/okurz / salt-pillars-openqa https://gitlab.suse.de/okurz/salt-pillars-openqa
Branch
feature/cc_areas https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commits/feature/cc_areas
Commit
2ed63adf https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commit/2ed63adfd0f326010f0c4a9f7532a111abc9e4d2
Add explicit "zone-cc" for all workers in commo...
Commit Author
Oliver Kurz https://gitlab.suse.de/okurz
Pipeline #1357922 https://gitlab.suse.de/okurz/salt-pillars-openqa/-/pipelines/1357922 triggered by Oliver Kurz https://gitlab.suse.de/okurz
had 1 failed job
Failed job
✖ post-deploy
telegraf <https://gitlab.suse.de/okurz/salt-pillars-openqa/-/jobs/3225392>
Updated by jbaier_cz about 1 month ago
The post-deploy telegraf job is already using the rules section (so the definition from defaults is overriden), https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L89 needs to be updated with the extra rules as well.
Updated by okurz about 1 month ago
- Status changed from Feedback to Workable
same problem as I observed in my MRs is also in the MR by gpathak https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/918
Updated by nicksinger about 1 month ago
- Status changed from Workable to Feedback
Updated by nicksinger about 1 month ago
- Status changed from Feedback to In Progress
nicksinger wrote in #note-16:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919
found another issue in my own MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1288 - not sure why it got scheduled there…
Updated by okurz about 1 month ago
Updated by nicksinger about 1 month ago
- Status changed from In Progress to Feedback
I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.
Updated by livdywan about 1 month ago
- Due date changed from 2024-10-16 to 2024-10-25
nicksinger wrote in #note-19:
I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.
Bumping the due date accordingly
Updated by livdywan about 1 month ago
- Has duplicate action #168475: salt-states-openqa telegraf pipeline failing with error in libcrypto added
Updated by nicksinger 27 days ago
- Due date changed from 2024-10-25 to 2024-11-01
livdywan wrote in #note-20:
nicksinger wrote in #note-19:
I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.
Bumping the due date accordingly
Had no chance yet to talk with Jan about this so bumping again by another week.
Updated by livdywan 27 days ago
okurz wrote in #note-18:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919 merged
Reading the upstream docs on rules again, I noticed:
if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH: If changes are pushed to the default branch. Use when you want to have the same configuration in multiple projects with different default branches.
The MR uses $CI_COMMIT_REF_NAME
. The again docs on predefined variables seem to confirm why we need it:
CI_COMMIT_BRANCH Pre-pipeline The commit branch name. Available in branch pipelines, including pipelines for the default branch. Not available in merge request pipelines or tag pipelines.
CI_COMMIT_REF_NAME Pre-pipeline The branch or tag name for which project is built.
Which seems like it should be correct but I can't find an example of our use case.
Updated by nicksinger 7 days ago
livdywan wrote in #note-25:
Perhaps we can wrap this up this week? Hopefully everyone involved is available now.
I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.
Updated by nicksinger 7 days ago
- Status changed from Feedback to Resolved
nicksinger wrote in #note-26:
I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.
merged. I tested with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1307 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/930 and they at least don't explode immediately. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1308 looks good but I don't want to merge a useless commit (and removal again) just to test a schedule. At least the previous issue of having jobs on forks is away now.