Project

General

Profile

Actions

action #117229

closed

[tools] openqa failing on worker QA-Power8-5-kvm

Added by emiura over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-09-26
Due date:
2022-10-13
% Done:

0%

Estimated time:

Description


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #114565: recover qa-power8-4+qa-power8-5 size:MResolvedokurz2022-12-19

Actions
Actions #1

Updated by okurz over 1 year ago

  • Subject changed from openqa failing on worker QA-Power8-5-kvm:7 to [tools] openqa failing on worker QA-Power8-5-kvm:7
  • Category set to Infrastructure
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by jbaier_cz over 1 year ago

  • Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Actions #3

Updated by mkittler over 1 year ago

  • Subject changed from [tools] openqa failing on worker QA-Power8-5-kvm:7 to [tools] openqa failing on worker QA-Power8-5-kvm
  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to mkittler

Looks like those jobs get stuck after starting QEMU (until MAX_JOB_TIME is exceeded and the worker terminates it). Considering it is still reacting to SIGTERM, it doesn't look like it is totally stuck, though.

The other worker slots on QA-Power8-5-kvm are affected as well; their job history looks very much the same. I'm updating the ticket title and description accordingly.

I'll check whether there's something suspicious in the worker's journal. Then I'll restart the machine. Maybe that helps.

Actions #4

Updated by mkittler over 1 year ago

Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] Stopping job 9541373 from openqa.suse.de: 09541373-sle-15-SP5-Online-ppc64le-yast2_apparmor:investigate:last_good_build:19.1@ppc64le-2g - reason: timeout
Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541373/status
Sep 19 21:21:34 QA-Power8-5-kvm worker[50951]: [debug] [pid:50951] Announcing job termination (due to timeout) to command server via http://localhost:20083/0Rh6KVs3M5nKbCaa/broadcast
Sep 19 21:21:34 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 21:21:36 QA-Power8-5-kvm worker[50951]: [info] [pid:50951] +++ worker notes +++
Sep 19 21:21:36 QA-Power8-5-kvm worker[50951]: [info] [pid:50951] End time: 2022-09-19 19:21:35
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] Stopping job 9541456 from openqa.suse.de: 09541456-sle-15-SP3-BCI-Updates-ppc64le-Build30.61_minimal-15sp3_image_on_SLES_15-SP1@ppc64le - reason: timeout
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541456/status
Sep 19 22:10:25 QA-Power8-5-kvm worker[52167]: [debug] [pid:52167] Announcing job termination (due to timeout) to command server via http://localhost:20063/uMPDE2CqbPblromg/broadcast
Sep 19 22:10:25 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:10:26 QA-Power8-5-kvm worker[82515]: [debug] [pid:82515] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9542532/status
Sep 19 22:10:26 QA-Power8-5-kvm worker[82515]: [debug] [pid:82515] Upload concluded (no current module)
Sep 19 22:11:43 QA-Power8-5-kvm worker[52245]: [debug] [pid:52245] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541493/status
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 19 22:11:43 QA-Power8-5-kvm worker[52225]: [debug] [pid:52225] Stopping job 9541492 from openqa.suse.de: 09541492-sle-15-SP4-BCI-Updates-ppc64le-Build22.21_init-image-15sp4_image_on_SLES_15@ppc64le - reason: timeout
Sep 19 22:11:43 QA-Power8-5-kvm worker[52225]: [debug] [pid:52225] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9541492/status
Sep 20 07:29:52 QA-Power8-5-kvm worker[30712]: [debug] [pid:30712] Uploading artefact vars.json
Sep 20 07:29:52 QA-Power8-5-kvm worker[30712]: [error] [pid:30712] REST-API error (POST http://openqa.suse.de/api/v1/jobs/9542570/status): Connection error: Premature connection close (remaining tries: 53)
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 07:29:52 QA-Power8-5-kvm worker[30575]: [info] [pid:30575] Uploading video_time.vtt
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] Stopping job 9546236 from openqa.suse.de: 09546236-sle-15-SP3-BCI-Updates-ppc64le-Build18.117_python-3.9-image-lang_image_on_SLES_15-SP4@ppc64le - reason: timeout
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9546236/status
Sep 20 14:13:33 QA-Power8-5-kvm worker[30796]: [debug] [pid:30796] Announcing job termination (due to timeout) to command server via http://localhost:20083/bAH3nzOFiwDkjxFc/broadcast
Sep 20 14:13:33 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] Stopping job 9546235 from openqa.suse.de: 09546235-sle-15-SP3-BCI-Updates-ppc64le-Build18.117_python-3.9-image-lang_image_on_SLES_15-SP3@ppc64le - reason: timeout
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] REST-API call: POST http://openqa.suse.de/api/v1/jobs/9546235/status
Sep 20 14:13:34 QA-Power8-5-kvm worker[30717]: [debug] [pid:30717] Announcing job termination (due to timeout) to command server via http://localhost:20073/GT6jzjuGyBZBAIYz/broadcast
Sep 20 14:13:34 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] +++ worker notes +++
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] End time: 2022-09-20 12:13:34
Sep 20 14:13:34 QA-Power8-5-kvm worker[30796]: [info] [pid:30796] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [info] [pid:63122] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [info] [pid:63273] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [info] [pid:63133] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [info] [pid:63149] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63122]: [warn] [pid:63122] Websocket connection to http://openqa.suse.de/api/v1/ws/1062 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63133]: [warn] [pid:63133] Websocket connection to http://openqa.suse.de/api/v1/ws/1058 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [info] [pid:63148] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] +++ worker notes +++
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] End time: 2022-09-28 08:48:19
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [info] [pid:63108] Result: timeout
Sep 28 10:48:19 QA-Power8-5-kvm worker[63149]: [warn] [pid:63149] Websocket connection to http://openqa.suse.de/api/v1/ws/1265 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[63273]: [warn] [pid:63273] Websocket connection to http://openqa.suse.de/api/v1/ws/1048 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1767]: [info] [pid:1767] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1767]: [debug] [pid:1767] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1766]: [info] [pid:1766] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[63148]: [warn] [pid:63148] Websocket connection to http://openqa.suse.de/api/v1/ws/1043 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1766]: [debug] [pid:1766] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[63108]: [warn] [pid:63108] Websocket connection to http://openqa.suse.de/api/v1/ws/1042 finished by remote side with code 1006, no reason - trying again in 10 seconds
Sep 28 10:48:19 QA-Power8-5-kvm worker[1770]: [info] [pid:1770] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1769]: [info] [pid:1769] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1770]: [debug] [pid:1770] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1769]: [debug] [pid:1769] Uploading artefact vars.json
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -4
Sep 28 10:48:19 QA-Power8-5-kvm worker[1771]: [info] [pid:1771] Uploading vars.json
Sep 28 10:48:19 QA-Power8-5-kvm worker[1771]: [debug] [pid:1771] Uploading artefact vars.json

So there's one unusual reoccurring error message (cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -4) happening in the context of these timeouts.

Explanation of CMA: https://developer.toradex.com/linux-bsp/how-to/linux-features/contiguous-memory-allocator-cma-linux/, https://lwn.net/Articles/486301/


Note that no CMA issues show up on QA-Power8-4-kvm which is currently still on Leap 15.3 (5.3.18-150300.59.93-default).

Actions #5

Updated by openqa_review over 1 year ago

  • Due date set to 2022-10-13

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by mkittler over 1 year ago

After rebooting the machine ended up on 5.14.21-150400.24.21-default again and the cma_alloc errors are gone and the job history looks better.

However, the machine isn't generally stable on that version. So for #114565 I now booted on 5.19.12-lp153.2.g95fa5b8-default, see https://bugzilla.opensuse.org/show_bug.cgi?id=1202138#c21. Let's see how well that version works.

Actions #7

Updated by mkittler over 1 year ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by mkittler over 1 year ago

I'm handling this as part of #114565, see most recent comments there.

Actions #9

Updated by mkittler over 1 year ago

  • Project changed from openQA Tests to openQA Infrastructure
  • Category deleted (Infrastructure)
Actions #10

Updated by mkittler over 1 year ago

Automatic recovery would make sense, see #114565#note-64.

Actions #11

Updated by mkittler over 1 year ago

  • Status changed from Feedback to Resolved

I'm closing this ticket in favor of #114565 because having to update always two tickets is annoying. The concrete issue of jobs timing out has been resolved by not using Linux 5.8 anymore anyways.

Actions

Also available in: Atom PDF