Project

General

Profile

Actions

action #75454

closed

sometimes clone job is incomplete because of api failure

Added by zluo almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Support
Target version:
Start date:
2020-10-28
Due date:
% Done:

0%

Estimated time:

Description

Please see following:
http://10.162.23.47/tests/8409

Result: incomplete finished 7 minutes ago ( 00:50 minutes )
Reason: api failure
Clone of 8408 Cloned as 8410
Scheduled product: job has not been created by posting an ISO (but possibly the original job)
Assigned worker: couperin:1

This happens sometimes and it just states api failure, but without any logs or further information.

Re-trying clone job works in most cases.


Files

openqa-worker@1.log.gz (11.1 MB) openqa-worker@1.log.gz zluo, 2020-10-29 07:45

Related issues 1 (1 open0 closed)

Copied to openQA Project - action #76765: job is incomplete with reason just being "api failure" and no logs can be uploaded due to OOM condition on worker, improve reason to point to potential causesWorkable

Actions
Actions #1

Updated by okurz almost 4 years ago

  • Subject changed from [tools] sometimes clone job is incomplete because of api failure to sometimes clone job is incomplete because of api failure
  • Category set to Support
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Target version set to Ready

as the job has no logs uploaded can you please provide the logs from the systemd worker service? E.g. log into couperin over ssh and call journalctl -u openqa-worker@1.

Actions #2

Updated by zluo almost 4 years ago

logs of openqa-worker, thanks.

Actions #3

Updated by okurz almost 4 years ago

  • Copied to action #76765: job is incomplete with reason just being "api failure" and no logs can be uploaded due to OOM condition on worker, improve reason to point to potential causes added
Actions #4

Updated by okurz almost 4 years ago

  • Status changed from Feedback to Resolved

ok, this is the complete log since 2020-04 (!). It would have been sufficient to just copy the relevant section from the worker journal at the time the job incompleted. There I see:

Oct 28 10:19:49 couperin worker[4904]: [debug] [pid:4904] REST-API call: POST http://10.162.23.47/api/v1/jobs/8408/status
Oct 28 10:19:49 couperin worker[4904]: [error] [pid:4904] Upload images subprocess error: Can't fork: Cannot allocate memory
Oct 28 10:19:59 couperin worker[4904]: [info] [pid:4904] Test schedule has changed, reloading test_order.json
Oct 28 10:19:59 couperin worker[4904]: [debug] [pid:4904] REST-API call: POST http://10.162.23.47/api/v1/jobs/8408/status
Oct 28 10:19:59 couperin worker[4904]: [error] [pid:4904] Upload images subprocess error: Can't fork: Cannot allocate memory
Oct 28 10:20:09 couperin worker[4904]: [debug] [pid:4904] REST-API call: POST http://10.162.23.47/api/v1/jobs/8408/status
Oct 28 10:20:11 couperin worker[4904]: [error] [pid:4904] Upload images subprocess error: Can't fork: Cannot allocate memory
Oct 28 10:20:21 couperin worker[4904]: [debug] [pid:4904] REST-API call: POST http://10.162.23.47/api/v1/jobs/8408/status
Oct 28 10:20:22 couperin worker[4904]: [error] [pid:4904] Upload images subprocess error: Can't fork: Cannot allocate memory
Oct 28 10:20:32 couperin worker[4904]: [debug] [pid:4904] REST-API call: POST http://10.162.23.47/api/v1/jobs/8408/status

so you exhausted all available memory of your machine. This is something that we can not prevent. I suggest you either try to carefully avoid such situation or you have a monitoring alert that informs you if your worker host is suffering from situations like these. What we can try to improve is the extend the "api failure" reason to point to potential problems that could lead to this symptom. Created #76765 for this minor improvement.

Actions

Also available in: Atom PDF