Project

General

Profile

action #94456

no data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Target version:
Start date:
2021-06-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 shows no data since 2021-06-20 on all panels, i.e. we receive no monitoring data from these ARM hosts since that date

Expected result

  • Sane data is back

Suggestions


Related issues

Related to openQA Infrastructure - action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator"Resolved2021-06-112021-06-29

Related to openQA Infrastructure - action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automaticallyResolved2021-03-102021-04-22

Copied to openQA Infrastructure - action #94513: openqaworker-arm-3 not reachable and not recoverable over usual waysResolved2021-06-222021-07-01

History

#1 Updated by okurz 3 months ago

I checked and at least openqaworker-arm-1 is up and running openQA jobs just fine. Maybe this is related to #93922#note-10

#2 Updated by okurz 3 months ago

  • Related to action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator" added

#3 Updated by mkittler 3 months ago

I've also noticed the lack of "ping data" on the new generic dashboards, see #91779#note-8.

#4 Updated by mkittler 3 months ago

martchus@openqa:~> sudo journalctl -fu telegraf
Jun 22 12:57:00 openqa telegraf[2034]: 2021-06-22T10:57:00Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host

But no errors about other hosts.

#6 Updated by okurz 3 months ago

  • Priority changed from Urgent to Immediate

By now all three ARM workers are offline and are not automatically recovered because with no data, no alert, no trigger in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines , bumping prio. Triggered power reset of openqaworker-arm-1 and openqaworker-arm-2 manually now, openqaworker-arm-3 is not reachable over IPMI. I do not know how to recover openqaworker-arm-3 other than manual reset by EngInfra. I thought in #89815 we have found a way to reset the complete chassis but I can't reach ipmi -> #94513

#7 Updated by okurz 3 months ago

  • Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added

#8 Updated by okurz 3 months ago

  • Copied to action #94513: openqaworker-arm-3 not reachable and not recoverable over usual ways added

#9 Updated by cdywan 3 months ago

Maybe a regression caused by #91779#note-7 which the timing would suggest? So follow-up or revert of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 would be sensible. Maybe the {worker,node}names? change affects jinja templates elsewhere?

#10 Updated by okurz 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

#12 Updated by mkittler 3 months ago

  • Status changed from In Progress to Resolved

The fix has been merged: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/512

I restarted telegraf to be sure that the new config is really applied and the ping data is still updated (after the manual fix). So this should be resolved.

#13 Updated by okurz 3 months ago

  • Status changed from Resolved to Feedback

For an urgent or immediate ticket we should do more than just a single fix. Let's talk about it.

#14 Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved

We have crosschecked and we found that only the ARM workers were hit in a problematic way. We worked on the ticket within 12h after setting to "Immediate" priority. We also extended the wiki. Reminder to everyone: Please take care that urgent or immediate tickets are picked up as fast as possible and get other people to help. Calling it "Resolved" now.

Also available in: Atom PDF