action #90968: [alert] Multiple flaky incomplete job alerts on Sunday - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

Copy link

action #90968

closed

[alert] Multiple flaky incomplete job alerts on Sunday

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-04-12

Due date:

2021-04-29

% Done:

Estimated time:

Description

Incomplete jobs (not restarted) of last 24h alert - Ok after 2 minutes

Metric name

Value

Queue: State (SUSE) alert* - OK after 3 minutes

Error message

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+mean%28%22scheduled%22%29+FROM+%22openqa_jobs%22+WHERE+%22url%22+%3D+%27https%3A%2F%2Fopenqa.suse.de%27+AND+time+%3E+now%28%29+-+1m+GROUP+BY+time%2840s%29+fill%28null%29": dial tcp [::1]:8086: connect: connection refused
Metric name

Value

New incompletes alert - OK after 3 minutes

Error message

tsdb.HandleRequest() error Get "http://localhost:8086/query?db=telegraf&epoch=s&q=SELECT+non_negative_difference%28distinct%28%22incompletes_last_24h%22%29%29+FROM+%22postgresql%22+WHERE+time+%3E+now%28%29+-+1m+GROUP+BY+time%2850ms%29": dial tcp [::1]:8086: connect: connection refused
Metric name

Value

History
Notes
Property changes

Actions

Copy link

Updated by okurz over 3 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by livdywan over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 3 years ago

Subject changed from Multiple flaky incomplete job alerts on Sunday to [alert] Multiple flaky incomplete job alerts on Sunday
Status changed from New to Workable

Actions

Copy link

Updated by mkittler over 3 years ago

This was caused by InfluxDB being shortly restarting and not by a high amount of incompletes:

martchus@openqa-monitor:~> systemctl status influxdb.service 
● influxdb.service - InfluxDB database server
   Loaded: loaded (/usr/lib/systemd/system/influxdb.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-04-11 03:30:36 CEST; 1 day 11h ago
 Main PID: 1668 (influxd)
    Tasks: 12
   CGroup: /system.slice/influxdb.service
           └─1668 /usr/bin/influxd -config /etc/influxdb/config.toml -pidfile /run/influxdb/influxdb.pid

The whole system was actually restarting:

martchus@openqa-monitor:~> sudo journalctl --system --boot
-- Logs begin at Wed 2021-04-07 18:19:11 CEST, end at Mon 2021-04-12 15:30:06 CEST. --
Apr 11 03:30:20 openqa-monitor kernel: Linux version 5.3.18-lp152.69-default (geeko@buildhost) (gcc version 7.5.0 (SUSE Linux)) #1 SMP Tue Apr 6 11:41:13 UTC 2021 (d532e33)

So it looks like InfluxDB wasn't just ready soon enough. Maybe we can just increase the grace period for InfluxDB being inaccessible.

Actions

Copy link

Updated by mkittler over 3 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

Maybe we can just increase the grace period for InfluxDB being inaccessible.

There's not really a grace period for that but I'll try to set the "No Data & Error Handling" for "If execution error or timeout" to "Keep last state". If InfluxDB is broken longer this should be catched by the failed systemd services alert anyways. I'll try to test whether it actually includes the monitoring host by intentionally failing a unit.

Actions

Copy link

Updated by okurz over 3 years ago

as discussed an alternative might be to overwrite the grafana with influxdb-probing, waiting until that is done and only then executing the expected service as so far we seem to have seen this problem only when influxdb+grafana starts up, not anytime during normal execution

Actions

Copy link

Updated by openqa_review over 3 years ago

Due date set to 2021-04-29

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler over 3 years ago

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/479

Actions

Copy link

Updated by okurz over 3 years ago

I will try https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/477 now. be ready for any alarm explosion

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&editPanel=17&tab=alert states it has no data since 2 minutes, maybe telegraf config is now borked.

Marius Kittler: Apr 15 16:32:26 openqa telegraf[17485]: 2021-04-15T14:32:26Z W! Telegraf is not permitted to read /etc/telegraf/telegraf.d
The dir is there. I've just been restarting telegraf and now it seems to work. Maybe salt restarted telegraf and only then created the directory.

Oliver Kurz: I suspected the same. I just triggered a restart of all telegraf services on all salt nodes as well. I can add a restart trigger in salt easily as well

Manually executing

telegraf --debug -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --test 2>&1 | grep '\(http_response\|systemd_failed\)'

looks good as well.

After some minutes it looks good again.

Actions

Copy link

#10

Updated by mkittler over 3 years ago

Status changed from In Progress to Feedback

My SR has been merged as well.

Actions

Copy link

#11

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

crosschecked. The new dashboard https://monitor.qa.suse.de/d/EML0bpuGk/monitoring?orgId=1 is there and maintained in salt. All good then.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #90968

[alert] Multiple flaky incomplete job alerts on Sunday

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by openqa_review over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago