action #94456: no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #94456

closed

no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-06-22

Due date:

% Done:

Estimated time:

Tags:

arm, alert, osd, monitoring, grafana, telegraf

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 shows no data since 2021-06-20 on all panels, i.e. we receive no monitoring data from these ARM hosts since that date

Expected result¶

Sane data is back

Suggestions¶

Check availability of these hosts manually
Crosscheck if we should have received alerts from other sources, e.g. https://gitlab.suse.de/openqa/grafana-webhook-actions

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz over 3 years ago

I checked and at least openqaworker-arm-1 is up and running openQA jobs just fine. Maybe this is related to #93922#note-10

Actions

Copy link

Updated by okurz over 3 years ago

Related to action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator" added

Actions

Copy link

Updated by mkittler over 3 years ago

I've also noticed the lack of "ping data" on the new generic dashboards, see #91779#note-8.

Actions

Copy link

Updated by mkittler over 3 years ago

martchus@openqa:~> sudo journalctl -fu telegraf
Jun 22 12:57:00 openqa telegraf[2034]: 2021-06-22T10:57:00Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host

But no errors about other hosts.

Actions

Copy link

Updated by mkittler over 3 years ago

The data is also not available for OSD itself: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=76&orgId=1

Actions

Copy link

Updated by okurz over 3 years ago

Priority changed from Urgent to Immediate

By now all three ARM workers are offline and are not automatically recovered because with no data, no alert, no trigger in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines , bumping prio. Triggered power reset of openqaworker-arm-1 and openqaworker-arm-2 manually now, openqaworker-arm-3 is not reachable over IPMI. I do not know how to recover openqaworker-arm-3 other than manual reset by EngInfra. I thought in #89815 we have found a way to reset the complete chassis but I can't reach ipmi -> #94513

Actions

Copy link

Updated by okurz over 3 years ago

Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added

Actions

Copy link

Updated by okurz over 3 years ago

Copied to action #94513: openqaworker-arm-3 not reachable and not recoverable over usual ways added

Actions

Copy link

Updated by livdywan over 3 years ago

Maybe a regression caused by #91779#note-7 which the timing would suggest? So follow-up or revert of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 would be sensible. Maybe the {worker,node}names? change affects jinja templates elsewhere?

Actions

Copy link

#10

Updated by okurz over 3 years ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#11

Updated by okurz over 3 years ago

Added new wiki section with what we learned: https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines

Actions

Copy link

#12

Updated by mkittler over 3 years ago

Status changed from In Progress to Resolved

The fix has been merged: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/512

I restarted telegraf to be sure that the new config is really applied and the ping data is still updated (after the manual fix). So this should be resolved.

Actions

Copy link

#13

Updated by okurz over 3 years ago

Status changed from Resolved to Feedback

For an urgent or immediate ticket we should do more than just a single fix. Let's talk about it.

Actions

Copy link

#14

Updated by okurz over 3 years ago

Status changed from Feedback to Resolved

We have crosschecked and we found that only the ARM workers were hit in a problematic way. We worked on the ticket within 12h after setting to "Immediate" priority. We also extended the wiki. Reminder to everyone: Please take care that urgent or immediate tickets are picked up as fast as possible and get other people to help. Calling it "Resolved" now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #94456

no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Observation¶

Expected result¶

Suggestions¶

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago