action #117229
closed[tools] openqa failing on worker QA-Power8-5-kvm
0%
Description
https://openqa.suse.de/tests/9563826
https://openqa.suse.de/tests/9563818
https://openqa.suse.de/tests/9564636
By looking in e.g. https://openqa.suse.de/admin/workers/1062, it appears that there are consistent failures since a couple of days.
Updated by okurz about 2 years ago
- Subject changed from openqa failing on worker QA-Power8-5-kvm:7 to [tools] openqa failing on worker QA-Power8-5-kvm:7
- Category set to Infrastructure
- Priority changed from Normal to High
- Target version set to Ready
Updated by jbaier_cz about 2 years ago
- Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Updated by mkittler about 2 years ago
- Subject changed from [tools] openqa failing on worker QA-Power8-5-kvm:7 to [tools] openqa failing on worker QA-Power8-5-kvm
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to mkittler
Looks like those jobs get stuck after starting QEMU (until MAX_JOB_TIME is exceeded and the worker terminates it). Considering it is still reacting to SIGTERM, it doesn't look like it is totally stuck, though.
The other worker slots on QA-Power8-5-kvm are affected as well; their job history looks very much the same. I'm updating the ticket title and description accordingly.
I'll check whether there's something suspicious in the worker's journal. Then I'll restart the machine. Maybe that helps.
Updated by mkittler about 2 years ago
Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] Stopping job 9541373 from openqa.suse.de: 09541373-sle-15-SP5-Online-ppc64le-yast2_apparmor:investigate:last_good_build:19.1@ppc64le-2g - reason: timeout
Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541373/status
Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] Announcing job termination (due to timeout) to command server via http://localhost:20083/0Rh6KVs3M5nKbCaa/broadcast
Sep 19 21:21:34 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 21:21:36 QA-Power8-5-kvm worker[50951]: [info] [pid:50951] +++ worker notes +++
Sep 19 21:21:36 QA-Power8-5-kvm worker[50951]: [info] [pid:50951] End time: 2022-09-19 19:21:35
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] Stopping job 9541456 from openqa.suse.de: 09541456-sle-15-SP3-BCI-Updates-ppc64le-Build30.61_minimal-15sp3_image_on_SLES_15-SP1@ppc64le - reason: timeout
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541456/status
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] Announcing job termination (due to timeout) to command server via http://localhost:20063/uMPDE2CqbPblromg/broadcast
Sep 19 22:10:25 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:10:26 QA-Power8-5-kvm worker[82515]: [debug] [pid:82515] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9542532/status
Sep 19 22:10:26 QA-Power8-5-kvm worker[82515]: [debug] [pid:82515] Upload concluded (no current module)
Sep 19 22:11:43 QA-Power8-5-kvm worker[52245]: [debug] [pid:52245] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541493/status
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm worker[52225]: [debug] [pid:52225] Stopping job 9541492 from openqa.suse.de: 09541492-sle-15-SP4-BCI-Updates-ppc64le-Build22.21_init-image-15sp4_image_on_SLES_15@ppc64le - reason: timeout
Sep 19 22:11:43 QA-Power8-5-kvm worker[52225]: [debug] [pid:52225] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541492/status
Sep 20 07:29:52 QA-Power8-5-kvm worker[30712]: [debug] [pid:30712] Uploading artefact vars.json
Sep 20 07:29:52 QA-Power8-5-kvm worker[30712]: [error] [pid:30712] REST-API error (POST http://openqa.suse.de/api/v1/jobs/9542570/status): Connection error: Premature connection close (remaining tries: 53)
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm worker[30575]: [info] [pid:30575] Uploading video_time.vtt
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] Stopping job 9546236 from openqa.suse.de: 09546236-sle-15-SP3-BCI-Updates-ppc64le-Build18.117_python-3.9-image-lang_image_on_SLES_15-SP4@ppc64le - reason: timeout
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9546236/status
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] Announcing job termination (due to timeout) to command server via http://localhost:20083/bAH3nzOFiwDkjxFc/broadcast
Sep 20 14:13:33 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] Stopping job 9546235 from openqa.suse.de: 09546235-sle-15-SP3-BCI-Updates-ppc64le-Build18.117_python-3.9-image-lang_image_on_SLES_15-SP3@ppc64le - reason: timeout
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9546235/status
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] Announcing job termination (due to timeout) to command server via http://localhost:20073/GT6jzjuGyBZBAIYz/broadcast
Sep 20 14:13:34 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] +++ worker notes +++
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] End time: 2022-09-20 12:13:34
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [warn] [pid:63122] Websocket connection to http://openqa.suse.de/api/v1/ws/1062 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [warn] [pid:63133] Websocket connection to http://openqa.suse.de/api/v1/ws/1058 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [warn] [pid:63149] Websocket connection to http://openqa.suse.de/api/v1/ws/1265 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [warn] [pid:63273] Websocket connection to http://openqa.suse.de/api/v1/ws/1048 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1767]: [info] [pid:1767] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1767]: [debug] [pid:1767] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1766]: [info] [pid:1766] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [warn] [pid:63148] Websocket connection to http://openqa.suse.de/api/v1/ws/1043 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1766]: [debug] [pid:1766] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [warn] [pid:63108] Websocket connection to http://openqa.suse.de/api/v1/ws/1042 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1770]: [info] [pid:1770] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1769]: [info] [pid:1769] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1770]: [debug] [pid:1770] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1769]: [debug] [pid:1769] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm worker[1771]: [info] [pid:1771] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1771]: [debug] [pid:1771] Uploading artefact vars.json
So there's one unusual reoccurring error message (cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
) happening in the context of these timeouts.
Explanation of CMA: https://developer.toradex.com/linux-bsp/how-to/linux-features/contiguous-memory-allocator-cma-linux/, https://lwn.net/Articles/486301/
Note that no CMA issues show up on QA-Power8-4-kvm which is currently still on Leap 15.3 (5.3.18-150300.59.93-default
).
Updated by openqa_review about 2 years ago
- Due date set to 2022-10-13
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 years ago
After rebooting the machine ended up on 5.14.21-150400.24.21-default again and the cma_alloc
errors are gone and the job history looks better.
However, the machine isn't generally stable on that version. So for #114565 I now booted on 5.19.12-lp153.2.g95fa5b8-default, see https://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c21. Let's see how well that version works.
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Updated by mkittler about 2 years ago
I'm handling this as part of #114565, see most recent comments there.
Updated by mkittler about 2 years ago
- Project changed from openQA Tests to openQA Infrastructure
- Category deleted (
Infrastructure)
Updated by mkittler about 2 years ago
Automatic recovery would make sense, see #114565#note-64.
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
I'm closing this ticket in favor of #114565 because having to update always two tickets is annoying. The concrete issue of jobs timing out has been resolved by not using Linux 5.8 anymore anyways.