action #132860
closedopenqa-piworker is unstable and needs regular power-cycles size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765
only thing found in logs:
salt_ping.log:
Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================
Acceptance criteria¶
- AC1: we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days
Suggestions¶
Identify the cause for regression
- likely something related to the hardware RTC
- try if it just works with Leap 15.5 because we wanted to upgrade anyway
- could be a recent kernel update so try to downgrade
If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever
As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances
Rollback steps¶
- Add back salt key with
ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"
Updated by nicksinger 9 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger 9 months ago
- Copied to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Updated by nicksinger 9 months ago
- Copied to deleted (action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org)
Updated by nicksinger 9 months ago
- Related to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Updated by nicksinger 9 months ago
- Status changed from In Progress to Feedback
System only reacted to ping and "salt.ping". "nmap" showed port 22 as open but ssh did not let me log in and "cmd.run" via salt timed out so I had to hard reset the setup via the PDU. I updated https://racktables.suse.de/index.php?page=object&tab=default&object_id=21043&hl_port_id=156476 with the correct entry and created https://progress.opensuse.org/issues/132902 to check the wrongly documented port of nibali at the next opportunity.
After a hard power-cycle I was able to access the machine again via ssh. No relevant failed openQA jobs to restart. Pipeline restarted as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1696369
Updated by nicksinger 9 months ago
- Status changed from Feedback to Resolved
pipeline finished successfully. We might need to reopen if this happens again but for now there is unfortunately nothing more I can do.
Updated by nicksinger 9 months ago
- Subject changed from GitlabCI - OSD deployment failed - minion returned non-zero status to openqa-piworker is got unstable and needs regular power-cycles
- Status changed from Resolved to Workable
While working on #132818 I saw that the host is down again. We apparently need to investigate further (e.g. by connecting a serial adapter to that machine)
Updated by nicksinger 9 months ago
- Tags changed from infra, alert, gitlab, deployment to infra, alert, gitlab, deployment, next-frankencampus-visit
- Assignee deleted (
nicksinger)
We need to connect some serial connection to the piworker. I didn't and don't plan today to visit FC so it might be better to unassign.
Updated by openqa_review 9 months ago
- Due date set to 2023-08-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 8 months ago
- Related to action #134735: [alert] openQA piworker openqa-piworker: host up alert added
Updated by nicksinger 2 months ago
- Status changed from Resolved to Blocked
- Priority changed from Urgent to Low
Priority reduced by implementing a lock as described in https://progress.opensuse.org/issues/132860#note-24 . But as long as it is referenced in our systems we should not resolve the ticket as this would indicate the problem is gone by now. Also @okurz has some unanswered questions.
Updated by dheidler 2 months ago
openqa-piworker:~ # zypper ll
# | Name | Type | Repository | Comment
--+---------------------+---------+------------+-----------
1 | kernel-default | package | (beliebig) | poo#132860
2 | kernel-default-base | package | (beliebig) | poo#132860
openqa-piworker:~ # systemctl disable hostapd.service
Removed /etc/systemd/system/multi-user.target.wants/hostapd.service.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper ll
Es sind keine Paketsperren definiert.
Disabled hostapd for now (to be able to reset the system from remote, if the bug is not fixed).
Let's see if it will survive an update.
Updated by dheidler 2 months ago
The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?
That would be very surprising considering that the system had 15.5 installed.
That's why I had to go back to the 15.4 kernel from the initial release of 15.4.
But now we are updated to the latest 15.5 kernel.