Project

General

Profile

Actions

action #132860

closed

openqa-piworker is unstable and needs regular power-cycles size:M

Added by osukup 9 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2023-07-17
Due date:
2024-02-27
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1694765

only thing found in logs:
salt_ping.log:

Currently the following minions are down:
8d7
< "openqa-piworker.qa.suse.de"
===================

Acceptance criteria

  • AC1: we are able to process openQA Raspberry Pi bare-metal jobs consistently over some days

Suggestions

  • Identify the cause for regression

    • likely something related to the hardware RTC
    • try if it just works with Leap 15.5 because we wanted to upgrade anyway
    • could be a recent kernel update so try to downgrade
  • If it is really necessary and you exhausted all other remote-controllable options then go to the office, unplug RTC, reinstall the system assuming it was a borked system and corruption, or whatever

  • As Plan Y (if options A to X failed) buy wifi&bluetooth adapter for a IPMI controllable server and use that instead to connect to the rpi bare metal test instances

Rollback steps

  • Add back salt key with ssh osd "sudo salt-key -y -a openqa-piworker.qa.suse.de"

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure - action #132902: Check and document PDU connection of nibali.qe.nue2.suse.orgResolvedokurz2023-07-17

Actions
Related to openQA Infrastructure - action #134735: [alert] openQA piworker openqa-piworker: host up alertResolveddheidler2023-08-28

Actions
Actions #1

Updated by osukup 9 months ago

I'm able ping openqa-piworker, but not ssh into it ..

Actions #2

Updated by okurz 9 months ago

  • Tags set to infra, alert, gitlab, deployment
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by nicksinger 9 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #4

Updated by osukup 9 months ago

now also schort-server.qa.suse.de

Actions #5

Updated by okurz 9 months ago

salt \* test.ping
…
schort-server.qa.suse.de:
    True

is just fine though.

Actions #6

Updated by nicksinger 9 months ago

  • Copied to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Actions #7

Updated by nicksinger 9 months ago

  • Copied to deleted (action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org)
Actions #8

Updated by nicksinger 9 months ago

  • Related to action #132902: Check and document PDU connection of nibali.qe.nue2.suse.org added
Actions #9

Updated by nicksinger 9 months ago

  • Status changed from In Progress to Feedback

System only reacted to ping and "salt.ping". "nmap" showed port 22 as open but ssh did not let me log in and "cmd.run" via salt timed out so I had to hard reset the setup via the PDU. I updated https://racktables.suse.de/index.php?page=object&tab=default&object_id=21043&hl_port_id=156476 with the correct entry and created https://progress.opensuse.org/issues/132902 to check the wrongly documented port of nibali at the next opportunity.

After a hard power-cycle I was able to access the machine again via ssh. No relevant failed openQA jobs to restart. Pipeline restarted as https://gitlab.suse.de/openqa/salt-pillars-openqa/-/jobs/1696369

Actions #10

Updated by nicksinger 9 months ago

  • Status changed from Feedback to Resolved

pipeline finished successfully. We might need to reopen if this happens again but for now there is unfortunately nothing more I can do.

Actions #11

Updated by nicksinger 9 months ago

  • Subject changed from GitlabCI - OSD deployment failed - minion returned non-zero status to openqa-piworker is got unstable and needs regular power-cycles
  • Status changed from Resolved to Workable

While working on #132818 I saw that the host is down again. We apparently need to investigate further (e.g. by connecting a serial adapter to that machine)

Actions #12

Updated by nicksinger 9 months ago

  • Tags changed from infra, alert, gitlab, deployment to infra, alert, gitlab, deployment, next-frankencampus-visit
  • Assignee deleted (nicksinger)

We need to connect some serial connection to the piworker. I didn't and don't plan today to visit FC so it might be better to unassign.

Actions #13

Updated by okurz 9 months ago

  • Description updated (diff)

I tried salt \* test.ping on OSD and found all machines fine except openqa-piworker so removed that from salt. Mentioned in rollback steps.

Actions #14

Updated by dheidler 9 months ago

It might be an option to move the piworker to a faster server. It could even be an x86_64 server. It just needs a lot of usb connections to the actual SUT RPis.

Actions #15

Updated by dheidler 9 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #16

Updated by okurz 9 months ago

Today you mentioned other good hypotheses which is that maybe the system is just overloaded, potentially due to the recent change in #129955 to use a more performant but also more demanding video encoder so this should be checked on the system.

Actions #17

Updated by openqa_review 9 months ago

  • Due date set to 2023-08-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by dheidler 9 months ago

Was able to powercycle the whole rpi hw setup via remote PDU and login afterwards.

Actions #19

Updated by dheidler 9 months ago

Disabled nfs mount, velociraptor, nscd, openqa workers 1-3.

Ideas for proceeding:

  • fsck
  • badblocks (non destructive mode)
  • memtest
  • other rpi4 hw
  • different sdcard
  • update to 15.5
Actions #20

Updated by okurz 9 months ago

  • Subject changed from openqa-piworker is got unstable and needs regular power-cycles to openqa-piworker is unstable and needs regular power-cycles size:M
  • Description updated (diff)
Actions #21

Updated by dheidler 9 months ago

I did some more tests and it seems that I was wrong when I thought that the RTC is the issue, as the problem takes some minutes to appear.

I did a full reinstall without the RTC and still ran into the problem.
It seems to appear when hostapd is started.

Actions #23

Updated by dheidler 9 months ago

Let's ignore for a minute that I already upgraded the system to 15.5. The oldest 15.4 kernel should be fine:

zypper --releasever=15.4 in --oldpackage kernel-default-5.14.21-150400.22.1
Actions #24

Updated by dheidler 9 months ago

Added some locks to prevent a kernel update.

zypper ll

# | Name                | Type    | Repository | Comment
--+---------------------+---------+------------+----------
1 | kernel-default      | package | (beliebig) |
2 | kernel-default-base | package | (beliebig) |
Actions #25

Updated by dheidler 9 months ago

  • Status changed from In Progress to Resolved

Enabled all services again and put the pi back into the rack.

Kernel is locked to oldest available 15.4 kernel, which seems to work for now.
Bug is reported - let's see if and what kernel people will say.

Actions #26

Updated by okurz 9 months ago

dheidler wrote:

Kernel is locked to oldest available 15.4 kernel, which seems to work for now.

The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?

Actions #27

Updated by okurz 8 months ago

  • Related to action #134735: [alert] openQA piworker openqa-piworker: host up alert added
Actions #28

Updated by nicksinger 2 months ago

  • Status changed from Resolved to Blocked
  • Priority changed from Urgent to Low

Priority reduced by implementing a lock as described in https://progress.opensuse.org/issues/132860#note-24 . But as long as it is referenced in our systems we should not resolve the ticket as this would indicate the problem is gone by now. Also @okurz has some unanswered questions.

Actions #29

Updated by tinita 2 months ago

  • Due date changed from 2023-08-09 to 2024-02-27
  • Status changed from Blocked to In Progress
Actions #30

Updated by dheidler 2 months ago

openqa-piworker:~ # zypper ll

# | Name                | Type    | Repository | Comment
--+---------------------+---------+------------+-----------
1 | kernel-default      | package | (beliebig) | poo#132860
2 | kernel-default-base | package | (beliebig) | poo#132860

openqa-piworker:~ # systemctl disable hostapd.service
Removed /etc/systemd/system/multi-user.target.wants/hostapd.service.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper rl 1
Festgelegte Sperre wurde erfolgreich entfernt.
1 Sperre wurde erfolgreich entfernt.
openqa-piworker:~ # zypper ll

Es sind keine Paketsperren definiert.

Disabled hostapd for now (to be able to reset the system from remote, if the bug is not fixed).
Let's see if it will survive an update.

Actions #31

Updated by dheidler 2 months ago

  • Status changed from In Progress to Resolved

Looks good.
Reenabled hostapd.

Actions #32

Updated by dheidler 2 months ago

The oldest available 15.4 kernel? But isn't that the one that the machine had installed when the problems started?

That would be very surprising considering that the system had 15.5 installed.
That's why I had to go back to the 15.4 kernel from the initial release of 15.4.

But now we are updated to the latest 15.5 kernel.

Actions

Also available in: Atom PDF