action #132860
closedopenqa-piworker is unstable and needs regular power-cycles size:M
0%
Description
Observation¶
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765
only thing found in logs:
salt_ping.log:
Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================
Acceptance criteria¶
- AC1: we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days
Suggestions¶
Identify the cause for regression
- likely something related to the hardware RTC
- try if it just works with Leap 15.5 because we wanted to upgrade anyway
- could be a recent kernel update so try to downgrade
If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever
As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances
Rollback steps¶
- Add back salt key with
ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"
Updated by osukup over 1 year ago
I'm able ping openqa-piworker, but not ssh into it ..
Updated by okurz over 1 year ago
- Tags set to infra, alert, gitlab, deployment
- Priority changed from Normal to Urgent
- Target version set to Ready
Updated by nicksinger over 1 year ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by okurz over 1 year ago
salt \* test.ping
…
schort-server.qa.suse.de:
True
is just fine though.
Updated by nicksinger over 1 year ago
- Copied to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Updated by nicksinger over 1 year ago
- Copied to deleted (action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org)
Updated by nicksinger over 1 year ago
- Related to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Feedback
System only reacted to ping and "salt.ping". "nmap" showed port 22 as open but ssh did not let me log in and "cmd.run" via salt timed out so I had to hard reset the setup via the PDU. I updated https://racktables.suse.de/index.php?page=object&tab=default&object_id=21043&hl_port_id=156476 with the correct entry and created https://progress.opensuse.org/issues/132902 to check the wrongly documented port of nibali at the next opportunity.
After a hard power-cycle I was able to access the machine again via ssh. No relevant failed openQA jobs to restart. Pipeline restarted as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1696369
Updated by nicksinger over 1 year ago
- Status changed from Feedback to Resolved
pipeline finished successfully. We might need to reopen if this happens again but for now there is unfortunately nothing more I can do.
Updated by nicksinger over 1 year ago
- Subject changed from GitlabCI - OSD deployment failed - minion returned non-zero status to openqa-piworker is got unstable and needs regular power-cycles
- Status changed from Resolved to Workable
While working on #132818 I saw that the host is down again. We apparently need to investigate further (e.g. by connecting a serial adapter to that machine)
Updated by nicksinger over 1 year ago
- Tags changed from infra, alert, gitlab, deployment to infra, alert, gitlab, deployment, next-frankencampus-visit
- Assignee deleted (
nicksinger)
We need to connect some serial connection to the piworker. I didn't and don't plan today to visit FC so it might be better to unassign.
Updated by okurz over 1 year ago
- Description updated (diff)
I tried salt \* test.ping
on OSD and found all machines fine except openqa-piworker so removed that from salt. Mentioned in rollback steps.
Updated by dheidler over 1 year ago
It might be an option to move the piworker to a faster server. It could even be an x86_64 server. It just needs a lot of usb connections to the actual SUT RPis.
Updated by dheidler over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to dheidler
Updated by okurz over 1 year ago
Today you mentioned other good hypotheses which is that maybe the system is just overloaded, potentially due to the recent change in #129955 to use a more performant but also more demanding video encoder so this should be checked on the system.
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by dheidler over 1 year ago
Was able to powercycle the whole rpi hw setup via remote PDU and login afterwards.
Updated by dheidler over 1 year ago
Disabled nfs mount, velociraptor, nscd, openqa workers 1-3.
Ideas for proceeding:
- fsck
- badblocks (non destructive mode)
- memtest
- other rpi4 hw
- different sdcard
- update to 15.5
Updated by okurz over 1 year ago
- Subject changed from openqa-piworker is got unstable and needs regular power-cycles to openqa-piworker is unstable and needs regular power-cycles size:M
- Description updated (diff)
Updated by dheidler over 1 year ago
I did some more tests and it seems that I was wrong when I thought that the RTC is the issue, as the problem takes some minutes to appear.
I did a full reinstall without the RTC and still ran into the problem.
It seems to appear when hostapd is started.
Updated by dheidler over 1 year ago
Updated by dheidler over 1 year ago
Let's ignore for a minute that I already upgraded the system to 15.5. The oldest 15.4 kernel should be fine:
zypper --releasever=15.4 in --oldpackage kernel-default-5.14.21-150400.22.1
Updated by dheidler over 1 year ago
Added some locks to prevent a kernel update.
zypper ll
# | Name | Type | Repository | Comment
--+---------------------+---------+------------+----------
1 | kernel-default | package | (beliebig) |
2 | kernel-default-base | package | (beliebig) |
Updated by dheidler over 1 year ago
- Status changed from In Progress to Resolved
Enabled all services again and put the pi back into the rack.
Kernel is locked to oldest available 15.4 kernel, which seems to work for now.
Bug is reported - let's see if and what kernel people will say.
Updated by okurz over 1 year ago
dheidler wrote:
Kernel is locked to oldest available 15.4 kernel, which seems to work for now.
The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?
Updated by okurz over 1 year ago
- Related to action #134735: [alert] openQA piworker openqa-piworker: host up alert added
Updated by nicksinger 10 months ago
- Status changed from Resolved to Blocked
- Priority changed from Urgent to Low
Priority reduced by implementing a lock as described in https://progress.opensuse.org/issues/132860#note-24 . But as long as it is referenced in our systems we should not resolve the ticket as this would indicate the problem is gone by now. Also @okurz has some unanswered questions.
Updated by dheidler 10 months ago
openqa-piworker:~ # zypper ll
# | Name | Type | Repository | Comment
--+---------------------+---------+------------+-----------
1 | kernel-default | package | (beliebig) | poo#132860
2 | kernel-default-base | package | (beliebig) | poo#132860
openqa-piworker:~ # systemctl disable hostapd.service
Removed /etc/systemd/system/multi-user.target.wants/hostapd.service.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper ll
Es sind keine Paketsperren definiert.
Disabled hostapd for now (to be able to reset the system from remote, if the bug is not fixed).
Let's see if it will survive an update.
Updated by dheidler 10 months ago
The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?
That would be very surprising considering that the system had 15.5 installed.
That's why I had to go back to the 15.4 kernel from the initial release of 15.4.
But now we are updated to the latest 15.5 kernel.
Updated by jbaier_cz 7 months ago
- Related to action #160089: Handle uncommented package lock on "kernel-default" and "kernel-default-base" on openqa-piworker added
Updated by jbaier_cz 7 months ago
I guess it should be safe to apply https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/806 then.