Project

General

Profile

action #116782

o3 s390 workers are offline

Added by kraih 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-09-19
Due date:
2022-10-04
% Done:

0%

Estimated time:

Description

Observation

The O3 workers openqaworker1_container:101/102/103/104 went offline 2 days ago (graceful disconnect). The journal looks like this since then:

Sep 19 14:36:27 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Scheduled restart job, restart counter is at 40055.
Sep 19 14:36:27 openqaworker1 systemd[1]: Stopped Podman container-openqaworker1_container_101.service.
Sep 19 14:36:27 openqaworker1 systemd[1]: Starting Podman container-openqaworker1_container_101.service...
Sep 19 14:36:29 openqaworker1 podman[3032]: time="2022-09-19T14:36:29+02:00" level=warning msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
Sep 19 14:36:29 openqaworker1 podman[3032]: time="2022-09-19T14:36:29+02:00" level=warning msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
Sep 19 14:36:30 openqaworker1 podman[3032]: 2022-09-19 14:36:30.048687627 +0200 CEST m=+2.267947307 container init 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opencontainers.image>
Sep 19 14:36:30 openqaworker1 podman[3032]: 2022-09-19 14:36:30.298700925 +0200 CEST m=+2.517960607 container start 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.versi>
Sep 19 14:36:30 openqaworker1 podman[3032]: openqaworker1_container_101
Sep 19 14:36:30 openqaworker1 podman[3665]: 2022-09-19 14:36:30.324363704 +0200 CEST m=+0.053321696 container died 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101)
Sep 19 14:36:30 openqaworker1 systemd[1]: Started Podman container-openqaworker1_container_101.service.
Sep 19 14:36:31 openqaworker1 podman[3665]: 2022-09-19 14:36:31.278853821 +0200 CEST m=+1.007811860 container cleanup 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.cre>
Sep 19 14:36:31 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 14:36:31 openqaworker1 podman[3997]: 2022-09-19 14:36:31.638557925 +0200 CEST m=+0.261362636 container cleanup 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.ver>
Sep 19 14:36:31 openqaworker1 podman[3997]: openqaworker1_container_101
Sep 19 14:36:31 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Failed with result 'exit-code'.

Acceptance criteria

  • AC1: The workers are online again

Suggestions


Related issues

Related to openQA Infrastructure - action #97751: replacement setup for o3 s390x openQA workers size:MResolved2021-09-17

Related to openQA Project - action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:MResolved2022-11-01

History

#1 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by okurz 3 months ago

I found that systemctl --failed lists var-lib-openqa-share.mount, maybe related. openqaworker1_container:101 shows that the latest openQA jobs ran 3 days ago, on 2022-09-16. registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest is certainly outdated and we should try to run a 15.4 or Tumbleweed version. Please update our upgrade instructions for workers accordingly to include that step for the next time.

Instructions for the setup can be found on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers

The original setup ticket was #97751

#3 Updated by okurz 3 months ago

  • Related to action #97751: replacement setup for o3 s390x openQA workers size:M added

#4 Updated by okurz 3 months ago

  • Description updated (diff)
  • Priority changed from High to Urgent

#5 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#6 Updated by mkittler 3 months ago

  • Status changed from New to In Progress

Re-conducted steps on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers. I tried registry.opensuse.org/devel/openqa/containers15.4/openqa_worker:latest instead of registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest but it looks like the containers are still on 15.2.. For today I'll keep it that way to process pending s390x jobs (which so far look good).

#7 Updated by openqa_review 3 months ago

  • Due date set to 2022-10-04

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback

Recovered the setup and updated the setup to use 15.4 (see updated instructions in Wiki).

Not all jobs are passing but these failures were there before and are likely caused by something on the remote work. It generally works on the updated container, e.g. https://openqa.opensuse.org/tests/2704235.

#9 Updated by mkittler 2 months ago

  • Status changed from Feedback to Resolved

I guess that can be considered resolved. Some s390x jobs are passing, some are failing (like before) but they are executed again.

#10 Updated by okurz about 1 month ago

  • Related to action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:M added

Also available in: Atom PDF