Project

General

Profile

Actions

action #167051

closed

coordination #161414: [epic] Improved salt based infrastructure management

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S

Added by okurz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2024-09-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145

monitor.qe.nue2.suse.org:
    2024-09-19T11:57:59Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most recent call last):...
    2024-09-19T11:58:14Z E! [telegraf] Error running agent: input plugins recorded 1 errors
    telegraf errors

systemctl status telegraf on monitor says

● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
     Loaded: loaded (/etc/systemd/system/telegraf.service; enabled; preset: disabled)
     Active: active (running) since Sun 2024-09-01 03:31:25 CEST; 2 weeks 4 days ago
       Docs: https://github.com/influxdata/telegraf
   Main PID: 1481 (telegraf)
      Tasks: 21 (limit: 4915)
        CPU: 8h 20min 48.515s
     CGroup: /system.slice/telegraf.service
             ├─1481 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
             └─1697 /usr/bin/dbus-daemon --syslog --fork --print-pid 4 --print-address 6 --session

Sep 19 13:00:20 monitor telegraf[1481]: 2024-09-19T11:00:20Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 13:12:38 monitor telegraf[1481]: 2024-09-19T11:12:38Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:44 monitor telegraf[1481]: 2024-09-19T11:12:44Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:48Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
Sep 19 13:12:54 monitor telegraf[1481]: 2024-09-19T11:12:54Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [outputs.influxdb] When writing to [http://openqa-monitor.qa.suse.de:8086]: failed doing req: Post "http://openqa-monitor.qa.suse.de:808>
Sep 19 13:14:23 monitor telegraf[1481]: 2024-09-19T11:14:23Z E! [agent] Error writing to outputs.influxdb: could not write any address
Sep 19 14:00:01 monitor telegraf[1481]: 2024-09-19T12:00:01Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/maintenance_queue_monitor.py": Traceback (most rec>
Sep 19 14:00:08 monitor telegraf[1481]: 2024-09-19T12:00:08Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retenti>
Sep 19 14:00:31 monitor telegraf[1481]: 2024-09-19T12:00:31Z E! [inputs.exec] Error in plugin: exec: exit status 1 for command "/etc/telegraf/scripts/collect_sleperf_test.py": Traceback (most recent c>

Acceptance criteria

  • AC1: Significant reduction in errors in our CI pipelines
  • AC2: Errors in business related tooling are still visible somewhere

Suggestions

  • Look into the influxdb connection error primarily. If it does not reproduce anymore and no further mentions in logs then no further action is needed DONE journal output unrelated to pipeline result, most likely temporary outage
  • Consider separating reporting about low level monitoring from business related tooling. At the very least adjust the grep
  • Report separate tickets about problems in business scripts DONE not applicable here, mentioned logs hint to a general network issue, currently all scripts return with 0
  • Consider splitting out no critical parts into a different conf file in /etc/telegraf.d/ and see if only the relevant ones are successful or a separate config for business scripts with separate telegraf service invocation, separate log or journal target DONE external scripts are already split out

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure (public) - action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:SResolvedgpathak2024-10-02

Actions
Has duplicate openQA Infrastructure (public) - action #168475: salt-states-openqa telegraf pipeline failing with error in libcryptoRejectedlivdywan

Actions
Copied to openQA Infrastructure (public) - action #168145: implement telegraf health check and adjust according pipelinesNew

Actions
Actions #1

Updated by okurz 3 months ago

  • Subject changed from https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de to https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/3109145 failed due to telegraf errors on monitor.qa.suse.de size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #2

Updated by nicksinger 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by openqa_review 3 months ago

  • Due date set to 2024-10-16

Setting due date based on mean cycle time of SUSE QE Tools

Actions #4

Updated by nicksinger 3 months ago

The problematic action triggering the failed pipeline is https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L82 - we run telegraf with "--test" which according to the documentation:

enable test mode: gather metrics, print them out, and exit. Note: Test mode only runs inputs, not processors, aggregators, or outputs

so the influx errors we see in the journal are related to runtime and not the pipeline. Regarding the execution errors; I think telegraf is handling this fine. It fails with an according log message and continues operation. We introduced https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/805 to resolve https://progress.opensuse.org/issues/125765. In my opinion we should move this check for empty/missing slo data in grafana. Having telegraf properly running after deployment should be covered by https://progress.opensuse.org/issues/167728

Actions #5

Updated by nicksinger 3 months ago

  • Related to action #167728: grafana dashboard for monitor.qe.nue2.suse.org size:S added
Actions #6

Updated by nicksinger 3 months ago

  • Description updated (diff)
Actions #7

Updated by nicksinger 2 months ago

Discussed with @okurz in jitsi how to continue here and we noted down some points together (either precursor or follow-up tasks or part of the ticket):

  1. Hackweek project log aggregation so that we can alert ourselves on errors in logs properly and consistently for all relevant services
  2. Split telegraf between (between what? -> research) to avoid mixing runtime/critical errors. This also can allows us to treat all errors as critical in a critical domain and be lenient in business process related services
  3. Consider to gather telegraf log errors, (could be to execute telegraf -test also periodically e.g. in pipeline schedule in https://gitlab.suse.de/openqa/salt-states-openqa/-/pipeline_schedules or logwarn) so that we can see telegraf errors that are not related to merge requests so that we are not confused in failing CI checks in open merge requests about unrelated problems
Actions #8

Updated by nicksinger 2 months ago

despite the more general discussion I think the currently easiest approach is to only run this step if actual telegraf configs were changed. I tried to add this with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 but fail to understand the mentioned error:

    jobs:telegraf config key may not be used with `rules`: only

this might be related to https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L17-19 (which apparently is deprecated?). But I also failed to quickly convert the logic behind these two lines into a rules-statement.

Actions #9

Updated by nicksinger 2 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1284 to limit telegraf runs to only changes to its config. I was also reading https://www.influxdata.com/blog/telegraf-best-practices/ which mentions multiple instances regarding stability but mainly to avoid loosing all data (which we currently don't cover and never saw huge problems with). It also doesn't really go into detail how to handle warnings/errors in the logs. But they mention the https://github.com/influxdata/telegraf/tree/master/plugins/outputs/health#health-output-plugin which seems very powerful and would allow us to request a status in pipelines while also defining exceptions. I will try to implement a simple example.

Actions #10

Updated by okurz 2 months ago

  • Parent task set to #161414

please put the "hack week project ideas" and "health-output-plugin" as new tickets within the parent #161414

Actions #11

Updated by nicksinger 2 months ago

  • Copied to action #168145: implement telegraf health check and adjust according pipelines added
Actions #12

Updated by nicksinger 2 months ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by okurz 2 months ago

  • Status changed from Resolved to Feedback

the below seems to be related to your changes. I pushed changes to my fork in preparation for an MR and then post-deploy checks were triggered and seemingly immediately failing. Note that the according CI jobs in the MR were completely fine. Can you please look into that

-------- Forwarded Message --------
Subject: salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
Date: Sat, 12 Oct 2024 12:13:08 +0000
From: GitLab@SUSE gitlab@suse.de
Reply-To: GitLab@SUSE gitlab@suse.de
To: okurz@suse.de

salt-pillars-openqa | Failed pipeline for feature/cc_areas | 2ed63adf
GitLab
✖ Pipeline #1357922 has failed!

Project Oliver Kurz https://gitlab.suse.de/okurz / salt-pillars-openqa https://gitlab.suse.de/okurz/salt-pillars-openqa
Branch

feature/cc_areas https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commits/feature/cc_areas

Commit

2ed63adf https://gitlab.suse.de/okurz/salt-pillars-openqa/-/commit/2ed63adfd0f326010f0c4a9f7532a111abc9e4d2

Add explicit "zone-cc" for all workers in commo...
Commit Author

Oliver Kurz https://gitlab.suse.de/okurz

Pipeline #1357922 https://gitlab.suse.de/okurz/salt-pillars-openqa/-/pipelines/1357922 triggered by Oliver Kurz https://gitlab.suse.de/okurz

had 1 failed job
Failed job
✖ post-deploy

telegraf <https://gitlab.suse.de/okurz/salt-pillars-openqa/-/jobs/3225392>
Actions #14

Updated by jbaier_cz 2 months ago

The post-deploy telegraf job is already using the rules section (so the definition from defaults is overriden), https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/deploy.yml?ref_type=heads#L89 needs to be updated with the extra rules as well.

Actions #15

Updated by okurz 2 months ago

  • Status changed from Feedback to Workable

same problem as I observed in my MRs is also in the MR by gpathak https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/918

Actions #16

Updated by nicksinger 2 months ago

  • Status changed from Workable to Feedback
Actions #17

Updated by nicksinger 2 months ago

  • Status changed from Feedback to In Progress

nicksinger wrote in #note-16:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919

found another issue in my own MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1288 - not sure why it got scheduled there…

Actions #19

Updated by nicksinger 2 months ago

  • Status changed from In Progress to Feedback

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Actions #20

Updated by livdywan 2 months ago

  • Due date changed from 2024-10-16 to 2024-10-25

nicksinger wrote in #note-19:

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Bumping the due date accordingly

Actions #21

Updated by livdywan 2 months ago

  • Has duplicate action #168475: salt-states-openqa telegraf pipeline failing with error in libcrypto added
Actions #22

Updated by nicksinger about 2 months ago

  • Due date changed from 2024-10-25 to 2024-11-01

livdywan wrote in #note-20:

nicksinger wrote in #note-19:

I need more help from @jbaier_cz to get this properly working across all repositories/situations. Will continue after his vacation.

Bumping the due date accordingly

Had no chance yet to talk with Jan about this so bumping again by another week.

Actions #23

Updated by livdywan about 2 months ago

okurz wrote in #note-18:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/919 merged

Reading the upstream docs on rules again, I noticed:

if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH: If changes are pushed to the default branch. Use when you want to have the same configuration in multiple projects with different default branches.

The MR uses $CI_COMMIT_REF_NAME. The again docs on predefined variables seem to confirm why we need it:

CI_COMMIT_BRANCH Pre-pipeline The commit branch name. Available in branch pipelines, including pipelines for the default branch. Not available in merge request pipelines or tag pipelines.
CI_COMMIT_REF_NAME Pre-pipeline The branch or tag name for which project is built.

Which seems like it should be correct but I can't find an example of our use case.

Actions #24

Updated by livdywan about 2 months ago

  • Due date changed from 2024-11-01 to 2024-11-08

Bumping to account for availability.

Actions #25

Updated by livdywan about 1 month ago

  • Due date changed from 2024-11-08 to 2024-11-15
  • Priority changed from Normal to High

Perhaps we can wrap this up this week? Hopefully everyone involved is available now.

Actions #26

Updated by nicksinger about 1 month ago

livdywan wrote in #note-25:

Perhaps we can wrap this up this week? Hopefully everyone involved is available now.

I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.

Actions #27

Updated by nicksinger about 1 month ago

  • Status changed from Feedback to Resolved

nicksinger wrote in #note-26:

I had a lengthy discussion with @jbaier_cz and he greatly helped me coming up with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1305 - of course we will only know after a merge if we really cover everything now.

merged. I tested with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1307 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/930 and they at least don't explode immediately. Also https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1308 looks good but I don't want to merge a useless commit (and removal again) just to test a schedule. At least the previous issue of having jobs on forks is away now.

Actions #28

Updated by okurz about 1 month ago

  • Due date deleted (2024-11-15)
Actions

Also available in: Atom PDF