action #94456
closedno data from any arm host on https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?orgId=1 shows no data since 2021-06-20 on all panels, i.e. we receive no monitoring data from these ARM hosts since that date
Expected result¶
- Sane data is back
Suggestions¶
- Check availability of these hosts manually
- Crosscheck if we should have received alerts from other sources, e.g. https://gitlab.suse.de/openqa/grafana-webhook-actions
Updated by okurz over 3 years ago
I checked and at least openqaworker-arm-1 is up and running openQA jobs just fine. Maybe this is related to #93922#note-10
Updated by okurz over 3 years ago
- Related to action #93922: grafana dashboard for "approximate result size by job group" fails to render any data with "InfluxDB Error: unsupported mean iterator type: *query.stringInterruptIterator" added
Updated by mkittler over 3 years ago
I've also noticed the lack of "ping data" on the new generic dashboards, see #91779#note-8.
Updated by mkittler over 3 years ago
martchus@openqa:~> sudo journalctl -fu telegraf
Jun 22 12:57:00 openqa telegraf[2034]: 2021-06-22T10:57:00Z E! [inputs.ping] Error in plugin: lookup backup-vm on 10.160.0.1:53: no such host
But no errors about other hosts.
Updated by mkittler over 3 years ago
The data is also not available for OSD itself: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=76&orgId=1
Updated by okurz over 3 years ago
- Priority changed from Urgent to Immediate
By now all three ARM workers are offline and are not automatically recovered because with no data, no alert, no trigger in https://gitlab.suse.de/openqa/grafana-webhook-actions/-/pipelines , bumping prio. Triggered power reset of openqaworker-arm-1 and openqaworker-arm-2 manually now, openqaworker-arm-3 is not reachable over IPMI. I do not know how to recover openqaworker-arm-3 other than manual reset by EngInfra. I thought in #89815 we have found a way to reset the complete chassis but I can't reach ipmi -> #94513
Updated by okurz over 3 years ago
- Related to action #89815: osd-deployment blocked by openqaworker-arm-3 offline and not recovered automatically added
Updated by okurz over 3 years ago
- Copied to action #94513: openqaworker-arm-3 not reachable and not recoverable over usual ways added
Updated by livdywan over 3 years ago
Maybe a regression caused by #91779#note-7 which the timing would suggest? So follow-up or revert of https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/507 would be sensible. Maybe the {worker,node}names?
change affects jinja templates elsewhere?
Updated by okurz over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by okurz over 3 years ago
Added new wiki section with what we learned: https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Setup-guide-for-new-machines
Updated by mkittler over 3 years ago
- Status changed from In Progress to Resolved
The fix has been merged: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/512
I restarted telegraf to be sure that the new config is really applied and the ping data is still updated (after the manual fix). So this should be resolved.
Updated by okurz over 3 years ago
- Status changed from Resolved to Feedback
For an urgent or immediate ticket we should do more than just a single fix. Let's talk about it.
Updated by okurz over 3 years ago
- Status changed from Feedback to Resolved
We have crosschecked and we found that only the ARM workers were hit in a problematic way. We worked on the ticket within 12h after setting to "Immediate" priority. We also extended the wiki. Reminder to everyone: Please take care that urgent or immediate tickets are picked up as fast as possible and get other people to help. Calling it "Resolved" now.