action #116782
closedo3 s390 workers are offline
0%
Description
Observation¶
The O3 workers openqaworker1_container:101/102/103/104
went offline 2 days ago (graceful disconnect). The journal looks like this since then:
Sep 19 14:36:27 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Scheduled restart job, restart counter is at 40055.
Sep 19 14:36:27 openqaworker1 systemd[1]: Stopped Podman container-openqaworker1_container_101.service.
Sep 19 14:36:27 openqaworker1 systemd[1]: Starting Podman container-openqaworker1_container_101.service...
Sep 19 14:36:29 openqaworker1 podman[3032]: time="2022-09-19T14:36:29+02:00" level=warning msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
Sep 19 14:36:29 openqaworker1 podman[3032]: time="2022-09-19T14:36:29+02:00" level=warning msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping"
Sep 19 14:36:30 openqaworker1 podman[3032]: 2022-09-19 14:36:30.048687627 +0200 CEST m=+2.267947307 container init 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opencontainers.image>
Sep 19 14:36:30 openqaworker1 podman[3032]: 2022-09-19 14:36:30.298700925 +0200 CEST m=+2.517960607 container start 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.versi>
Sep 19 14:36:30 openqaworker1 podman[3032]: openqaworker1_container_101
Sep 19 14:36:30 openqaworker1 podman[3665]: 2022-09-19 14:36:30.324363704 +0200 CEST m=+0.053321696 container died 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101)
Sep 19 14:36:30 openqaworker1 systemd[1]: Started Podman container-openqaworker1_container_101.service.
Sep 19 14:36:31 openqaworker1 podman[3665]: 2022-09-19 14:36:31.278853821 +0200 CEST m=+1.007811860 container cleanup 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.cre>
Sep 19 14:36:31 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 14:36:31 openqaworker1 podman[3997]: 2022-09-19 14:36:31.638557925 +0200 CEST m=+0.261362636 container cleanup 955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812 (image=registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest, name=openqaworker1_container_101, org.opensuse.base.ver>
Sep 19 14:36:31 openqaworker1 podman[3997]: openqaworker1_container_101
Sep 19 14:36:31 openqaworker1 systemd[1]: container-openqaworker1_container_101.service: Failed with result 'exit-code'.
Acceptance criteria¶
- AC1: The workers are online again
Suggestions¶
Speak with dheidler, who set these workers upRead history from setup ticket #97751 and read instructions from https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers- Run a 15.4 or Tumbleweed version
- Update our upgrade instructions for workers accordingly to include that step for the next time
Updated by okurz about 2 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by okurz about 2 years ago
I found that systemctl --failed
lists var-lib-openqa-share.mount, maybe related. openqaworker1_container:101 shows that the latest openQA jobs ran 3 days ago, on 2022-09-16. registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest is certainly outdated and we should try to run a 15.4 or Tumbleweed version. Please update our upgrade instructions for workers accordingly to include that step for the next time.
Instructions for the setup can be found on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers
The original setup ticket was #97751
Updated by okurz about 2 years ago
- Related to action #97751: replacement setup for o3 s390x openQA workers size:M added
Updated by okurz about 2 years ago
- Description updated (diff)
- Priority changed from High to Urgent
Updated by mkittler about 2 years ago
- Status changed from New to In Progress
Re-conducted steps on https://progress.opensuse.org/projects/openqav3/wiki/Wiki#o3-s390-workers. I tried registry.opensuse.org/devel/openqa/containers15.4/openqa_worker:latest
instead of registry.opensuse.org/devel/openqa/containers15.2/openqa_worker:latest
but it looks like the containers are still on 15.2.. For today I'll keep it that way to process pending s390x jobs (which so far look good).
Updated by openqa_review about 2 years ago
- Due date set to 2022-10-04
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Recovered the setup and updated the setup to use 15.4 (see updated instructions in Wiki).
Not all jobs are passing but these failures were there before and are likely caused by something on the remote work. It generally works on the updated container, e.g. https://openqa.opensuse.org/tests/2704235.
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
I guess that can be considered resolved. Some s390x jobs are passing, some are failing (like before) but they are executed again.
Updated by okurz about 2 years ago
- Related to action #119713: Leap tests are failing because of failed log file uploading in multiple tests on s390x size:M added