Project

General

Profile

Actions

action #94456

closed

no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Added by okurz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 shows no data since 2021-06-20 on all panels, i.e. we receive no monitoring data from these ARM hosts since that date

Expected result

  • Sane data is back

Suggestions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator"Resolvedmkittler2021-06-112021-06-29

Actions
Related to openQA Infrastructure - action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automaticallyResolvedmkittler2021-03-102021-04-22

Actions
Copied to openQA Infrastructure - action #94513: openqaworker-arm-3 not reachable and not recoverable over usual waysResolvedokurz2021-06-222021-07-01

Actions
Actions #1

Updated by okurz over 3 years ago

I checked and at least openqaworker-arm-1 is up and running openQA jobs just fine. Maybe this is related to #93922#note-10

Actions #2

Updated by okurz over 3 years ago

  • Related to action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator" added
Actions #3

Updated by mkittler over 3 years ago

I've also noticed the lack of "ping data" on the new generic dashboards, see #91779#note-8.

Actions #4

Updated by mkittler over 3 years ago

martchus@openqa:~> sudo journalctl -fu telegraf
Jun 22 12:57:00 openqa telegraf[2034]: 2021-06-22T10:57:00Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host

But no errors about other hosts.

Actions #6

Updated by okurz over 3 years ago

  • Priority changed from Urgent to Immediate

By now all three ARM workers are offline and are not automatically recovered because with no data, no alert, no trigger in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines , bumping prio. Triggered power reset of openqaworker-arm-1 and openqaworker-arm-2 manually now, openqaworker-arm-3 is not reachable over IPMI. I do not know how to recover openqaworker-arm-3 other than manual reset by EngInfra. I thought in #89815 we have found a way to reset the complete chassis but I can't reach ipmi -> #94513

Actions #7

Updated by okurz over 3 years ago

  • Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added
Actions #8

Updated by okurz over 3 years ago

  • Copied to action #94513: openqaworker-arm-3 not reachable and not recoverable over usual ways added
Actions #9

Updated by livdywan over 3 years ago

Maybe a regression caused by #91779#note-7 which the timing would suggest? So follow-up or revert of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 would be sensible. Maybe the {worker,node}names? change affects jinja templates elsewhere?

Actions #10

Updated by okurz over 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #12

Updated by mkittler over 3 years ago

  • Status changed from In Progress to Resolved

The fix has been merged: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/512

I restarted telegraf to be sure that the new config is really applied and the ping data is still updated (after the manual fix). So this should be resolved.

Actions #13

Updated by okurz over 3 years ago

  • Status changed from Resolved to Feedback

For an urgent or immediate ticket we should do more than just a single fix. Let's talk about it.

Actions #14

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

We have crosschecked and we found that only the ARM workers were hit in a problematic way. We worked on the ticket within 12h after setting to "Immediate" priority. We also extended the wiki. Reminder to everyone: Please take care that urgent or immediate tickets are picked up as fast as possible and get other people to help. Calling it "Resolved" now.

Actions

Also available in: Atom PDF