Project

General

Profile

Actions

action #95299

closed

Tests timeout with reason 'setup exceeded MAX_SETUP_TIME' on osd ppc64le workers auto_review:"Result: timeout":retry size:M

Added by dzedro almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

I don't know why this ppc64le jobs timed out, cache service ?
Should the regex be more specific or can it be generic for any timeout ?

https://openqa.suse.de/tests/6393939
https://openqa.suse.de/tests/6393941
https://openqa.suse.de/tests/6393945

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#95299

ssh osd "sudo -u geekotest psql --command=\"select jobs.id,result_dir,t_finished,host,instance from jobs join workers on jobs.assigned_worker_id=workers.id where reason ~ 'timeout: setup exceeded' order by t_finished;\" openqa"

Expected result

A log https://openqa.suse.de/tests/6416847/logfile?filename=autoinst-log.txt shows how it should look:

[2021-07-12T10:36:56.0008 CEST] [info] [pid:64283] Download of SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso processed:
[info] [#106833]
Cache size of "/var/lib/openqa/cache" is 44GiB, with limit 50GiB
[info] [#106833]
Downloading "SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/6416847/asset/iso/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso"
[info] [#106833]
Content of "/var/lib/openqa/cache/openqa.suse.de/SLE-15-SP1-Installer-DVD-ppc64le-GM-DVD1.iso" has not changed, updating last use

[2021-07-12T10:36:56.0100 CEST] [info] [pid:64283] Rsync from 'rsync://openqa.suse.de/tests' to '/var/lib/openqa/cache/openqa.suse.de', request #106839 sent to Cache Service
[2021-07-12T10:37:01.0168 CEST] [info] [pid:64283] Output of rsync:
[info] [#106839] Calling: rsync -avHP rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
receiving incremental file list

sent 1,992 bytes  received 2,581,652 bytes  1,722,429.33 bytes/sec
total size is 12,978,495,757  speedup is 5,023.33

[2021-07-12T10:37:01.0168 CEST] [info] [pid:64283] Finished to rsync tests
[2021-07-12T10:37:01.0172 CEST] [debug] [pid:70271] +++ worker notes +++

where one can see the output from the rsync call.

Problem

Problem seems to be specific to ppc64le workers, maybe only "malbec" now. Maybe specific to test syncing. The cacheservice-minion logs do not mention the test rsync request at all.

Suggestions

  • Start with calling git grep on the source code of openQA for the log messages that we see in the jobs mentioned above (or see comments below), to see which debug messages should be expected before the timeout
  • If you identify that there could be helpful log messages, e.g. to be able to distinguish if a request was received by the minion service or not, add it

Related issues 3 (1 open2 closed)

Related to openQA Project - coordination #96185: [epic] Multimachine failure rate increasedResolvedokurz2021-07-29

Actions
Copied to openQA Project - action #96254: Tests timeout with MAX_SETUP_TIME - Add an alert if there is any non-restarted job exceeding max_setup_timeNew

Actions
Copied to openQA Infrastructure - action #96257: Tests timeout with MAX_SETUP_TIME on osd - Apply a higher MAX_SETUP_TIME applicable for the *complete* OSD infrastructure, e.g. add to every worker config entryResolvedmkittler

Actions
Actions

Also available in: Atom PDF