action #132860: openqa-piworker is unstable and needs regular power-cycles size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #132860

closed

openqa-piworker is unstable and needs regular power-cycles size:M

Added by osukup over 1 year ago. Updated 9 months ago.

Status:

Resolved

Priority:

Low

Assignee:

dheidler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-07-17

Due date:

2024-02-27

% Done:

Estimated time:

Tags:

alert, deployment, gitlab, infra, next-frankencampus-visit

Description

Observation¶

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765

only thing found in logs:
salt_ping.log:

Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================

Acceptance criteria¶

AC1: we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days

Suggestions¶

Identify the cause for regression
- likely something related to the hardware RTC
- try if it just works with Leap 15.5 because we wanted to upgrade anyway
- could be a recent kernel update so try to downgrade
If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever
As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances

Rollback steps¶

Add back salt key with ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by osukup over 1 year ago

I'm able ping openqa-piworker, but not ssh into it ..

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to infra, alert, gitlab, deployment
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from New to In Progress
Assignee set to nicksinger

Actions

Copy link

Updated by osukup over 1 year ago

now also schort-server.qa.suse.de

Actions

Copy link

Updated by okurz over 1 year ago

salt \* test.ping
…
schort-server.qa.suse.de:
    True

is just fine though.

Actions

Copy link

Updated by nicksinger over 1 year ago

Copied to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added

Actions

Copy link

Updated by nicksinger over 1 year ago

Copied to deleted (action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org)

Actions

Copy link

Updated by nicksinger over 1 year ago

Related to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added

Actions

Copy link

Updated by nicksinger over 1 year ago

Status changed from In Progress to Feedback

System only reacted to ping and "salt.ping". "nmap" showed port 22 as open but ssh did not let me log in and "cmd.run" via salt timed out so I had to hard reset the setup via the PDU. I updated https://racktables.suse.de/index.php?page=object&tab=default&object_id=21043&hl_port_id=156476 with the correct entry and created https://progress.opensuse.org/issues/132902 to check the wrongly documented port of nibali at the next opportunity.

After a hard power-cycle I was able to access the machine again via ssh. No relevant failed openQA jobs to restart. Pipeline restarted as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1696369

Actions

Copy link

#10

Updated by nicksinger over 1 year ago

Status changed from Feedback to Resolved

pipeline finished successfully. We might need to reopen if this happens again but for now there is unfortunately nothing more I can do.

Actions

Copy link

#11

Updated by nicksinger over 1 year ago

Subject changed from GitlabCI - OSD deployment failed - minion returned non-zero status to openqa-piworker is got unstable and needs regular power-cycles
Status changed from Resolved to Workable

While working on #132818 I saw that the host is down again. We apparently need to investigate further (e.g. by connecting a serial adapter to that machine)

Actions

Copy link

#12

Updated by nicksinger over 1 year ago

Tags changed from infra, alert, gitlab, deployment to infra, alert, gitlab, deployment, next-frankencampus-visit
Assignee deleted (~~nicksinger~~)

We need to connect some serial connection to the piworker. I didn't and don't plan today to visit FC so it might be better to unassign.

Actions

Copy link

#13

Updated by okurz over 1 year ago

Description updated (diff)

I tried salt \* test.ping on OSD and found all machines fine except openqa-piworker so removed that from salt. Mentioned in rollback steps.

Actions

Copy link

#14

Updated by dheidler over 1 year ago

It might be an option to move the piworker to a faster server. It could even be an x86_64 server. It just needs a lot of usb connections to the actual SUT RPis.

Actions

Copy link

#15

Updated by dheidler over 1 year ago

Status changed from Workable to In Progress
Assignee set to dheidler

Actions

Copy link

#16

Updated by okurz over 1 year ago

Today you mentioned other good hypotheses which is that maybe the system is just overloaded, potentially due to the recent change in #129955 to use a more performant but also more demanding video encoder so this should be checked on the system.

Actions

Copy link

#17

Updated by openqa_review over 1 year ago

Due date set to 2023-08-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#18

Updated by dheidler over 1 year ago

Was able to powercycle the whole rpi hw setup via remote PDU and login afterwards.

Actions

Copy link

#19

Updated by dheidler over 1 year ago

Disabled nfs mount, velociraptor, nscd, openqa workers 1-3.

Ideas for proceeding:

fsck
badblocks (non destructive mode)
memtest
other rpi4 hw
different sdcard
update to 15.5

Actions

Copy link

#20

Updated by okurz over 1 year ago

Subject changed from openqa-piworker is got unstable and needs regular power-cycles to openqa-piworker is unstable and needs regular power-cycles size:M
Description updated (diff)

Actions

Copy link

#21

Updated by dheidler over 1 year ago

I did some more tests and it seems that I was wrong when I thought that the RTC is the issue, as the problem takes some minutes to appear.

I did a full reinstall without the RTC and still ran into the problem.
It seems to appear when hostapd is started.

Actions

Copy link

#22

Updated by dheidler over 1 year ago

Opened https://bugzilla.suse.com/show_bug.cgi?id=1213757

Actions

Copy link

#23

Updated by dheidler over 1 year ago

Let's ignore for a minute that I already upgraded the system to 15.5. The oldest 15.4 kernel should be fine:

zypper --releasever=15.4 in --oldpackage kernel-default-5.14.21-150400.22.1

Actions

Copy link

#24

Updated by dheidler over 1 year ago

Added some locks to prevent a kernel update.

zypper ll

# | Name                | Type    | Repository | Comment
--+---------------------+---------+------------+----------
1 | kernel-default      | package | (beliebig) |
2 | kernel-default-base | package | (beliebig) |

Actions

Copy link

#25

Updated by dheidler over 1 year ago

Status changed from In Progress to Resolved

Enabled all services again and put the pi back into the rack.

Kernel is locked to oldest available 15.4 kernel, which seems to work for now.
Bug is reported - let's see if and what kernel people will say.

Actions

Copy link

#26

Updated by okurz over 1 year ago

dheidler wrote:

Kernel is locked to oldest available 15.4 kernel, which seems to work for now.

The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?

Actions

Copy link

#27

Updated by okurz over 1 year ago

Related to action #134735: [alert] openQA piworker openqa-piworker: host up alert added

Actions

Copy link

#28

Updated by nicksinger 12 months ago

Status changed from Resolved to Blocked
Priority changed from Urgent to Low

Priority reduced by implementing a lock as described in https://progress.opensuse.org/issues/132860#note-24 . But as long as it is referenced in our systems we should not resolve the ticket as this would indicate the problem is gone by now. Also @okurz has some unanswered questions.

Actions

Copy link

#29

Updated by tinita 11 months ago

Due date changed from 2023-08-09 to 2024-02-27
Status changed from Blocked to In Progress

Actions

Copy link

#30

Updated by dheidler 11 months ago

openqa-piworker:~ # zypper ll

# | Name                | Type    | Repository | Comment
--+---------------------+---------+------------+-----------
1 | kernel-default      | package | (beliebig) | poo#132860
2 | kernel-default-base | package | (beliebig) | poo#132860

openqa-piworker:~ # systemctl disable hostapd.service
Removed /etc/systemd/system/multi-user.target.wants/hostapd.service.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper ll

Es sind keine Paketsperren definiert.

Disabled hostapd for now (to be able to reset the system from remote, if the bug is not fixed).
Let's see if it will survive an update.

Actions

Copy link

#31

Updated by dheidler 11 months ago

Status changed from In Progress to Resolved

Looks good.
Reenabled hostapd.

Actions

Copy link

#32

Updated by dheidler 11 months ago

The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?

That would be very surprising considering that the system had 15.5 installed.
That's why I had to go back to the 15.4 kernel from the initial release of 15.4.

But now we are updated to the latest 15.5 kernel.

Actions

Copy link

#33

Updated by jbaier_cz 9 months ago

Related to action #160089: Handle uncommented package lock on "kernel-default" and "kernel-default-base" on openqa-piworker added

Actions

Copy link

#34

Updated by jbaier_cz 9 months ago

I guess it should be safe to apply https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/806 then.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #132860

openqa-piworker is unstable and needs regular power-cycles size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback steps¶

Updated by osukup over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by osukup over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by nicksinger over 1 year ago

Updated by okurz over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by openqa_review over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by dheidler over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by nicksinger 12 months ago

Updated by tinita 11 months ago

Updated by dheidler 11 months ago

Updated by dheidler 11 months ago

Updated by dheidler 11 months ago

Updated by jbaier_cz 9 months ago

Updated by jbaier_cz 9 months ago