action #58700: It seems we have "Test schedule has changed, reloading test_order.json" in *every* job now, I doubt the schedule actually changes - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #58700

closed

It seems we have "Test schedule has changed, reloading test_order.json" in every job now, I doubt the schedule actually changes

Added by okurz over 5 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

tinita

Category:

Regressions/Crashes

Target version:

Current Sprint

Start date:

2019-10-25

Due date:

% Done:

Estimated time:

Description

Observation¶

See the worker log from every job. The message "Test schedule has changed, reloading test_order.json" is repeated very often when we would expect to

Acceptance criteria¶

AC1: The test order is only re-evaluated if necessary
AC2: Any logging output corresponds to actual actions, i.e. message "Test schedule has changed, reloading test_order.json" when test schedule has actually changed, e.g. worker log from https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Incidents-Kernel&machine=64bit&test=ltp_pty&version=15 only shows a single (or limited) "Test schedule has changed…"
AC3: the test order is not re-evaluated for static schedules, e.g. worker log from https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=textmode&version=Tumbleweed does not show "Test schedule has changed, reloading test_order.json"

Suggestions¶

Research what was the original idea of #53948 and https://github.com/os-autoinst/openQA/pull/2302 together with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329 and the approaches of https://github.com/os-autoinst/openQA/pull/2439 and https://github.com/os-autoinst/openQA/pull/2445 trying to fix the problem
Find a good test case based on the openQA scenarios mentioned above
Ensure there is sufficient unit test coverage within os-autoinst for the feature
Fix the real problem :)

Actions

Copy link

Updated by tinita over 5 years ago

Status changed from New to In Progress
Assignee set to tinita

I see the message appearing once in every job.
I think the log message should just be written when the size or mtime really changed.
That's an easy fix.

Actions

Copy link

Updated by tinita over 5 years ago

% Done changed from 0 to 100

PR: https://github.com/os-autoinst/openQA/pull/2439

Actions

Copy link

Updated by tinita over 5 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by tinita over 5 years ago

PR was merged

Actions

Copy link

Updated by okurz over 5 years ago

Please see https://github.com/os-autoinst/openQA/pull/2439#discussion_r339448791

Actions

Copy link

Updated by tinita over 5 years ago

New PR: https://github.com/os-autoinst/openQA/pull/2445

Actions

Copy link

Updated by MDoucha over 5 years ago

FYI, this is not a bug. The log message was printed in every job on purpose to debug misconfigured workers. Specifically the exact kind of misconfiguration which has been blocking https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329 for the past month.

Edit: And yes, the test schedule does change when the log line gets printed the first time. From empty to something.

Actions

Copy link

Updated by livdywan over 5 years ago

MDoucha wrote:

FYI, this is not a bug. The log message was printed in every job on purpose to debug misconfigured workers. Specifically the exact kind of misconfiguration which has been blocking https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329 for the past month.

Edit: And yes, the test schedule does change when the log line gets printed the first time. From empty to something.

Would you mind clarifying the use case here? How is the message being used for debugging?

The message, the way I read it, suggests the schedule changed, which implies it's not the usual schedule but it was explicitly updated. That's also what was discussed when the code was being worked on. If that were shown in every test regardless it would seem to defeat the point?

Actions

Copy link

Updated by tinita over 5 years ago

Status changed from Feedback to Workable
% Done changed from 100 to 0

Actions

Copy link

#10

Updated by tinita over 5 years ago

Assignee deleted (~~tinita~~)

Actions

Copy link

#11

Updated by mkittler almost 5 years ago

Status changed from Workable to New

Considering the discussion is isn't clear what's left to be done.

Actions

Copy link

#12

Updated by okurz almost 5 years ago

Description updated (diff)
Status changed from New to Workable

I hope I could provide sufficient details that you also consider it "Workable" :)

Actions

Copy link

#13

Updated by Xiaojing_liu almost 5 years ago

May I have a try? I am interested in this feature, although I am not familiar with this.

Actions

Copy link

#14

Updated by Xiaojing_liu almost 5 years ago

Status changed from Workable to In Progress
Assignee set to Xiaojing_liu
Target version set to Current Sprint

Actions

Copy link

#15

Updated by Xiaojing_liu almost 5 years ago

I did some tests about this issue, here is test report:
For now we check that if the test_order has been changed by checking if the current_test_module exists or the file test_order.json modify time/size is different from before value. There are some reasons that maybe we should not use current_test_module

the current test name is null between a test module finish and another test module start. In some case, the process can last more than 1 min. Here is an example: The message in os-autoinst.log

[0m[37m[2020-05-28T10:37:47.903 CST] [debug] ||| finished zypper_up update at 2020-05-28 02:37:47 (42 s)
[0m[37m[2020-05-28T10:37:47.904 CST] [debug] Creating a VM snapshot lastgood
[0m[37m[2020-05-28T10:37:47.905 CST] [debug] Saving snapshot (Current VM state is running).
...
[0m[37m[2020-05-28T10:38:25.585 CST] [debug] Snapshot complete
[0m[1;34m[2020-05-28T10:38:25.640 CST] [debug] ||| starting system_prepare tests/console/system_prepare.pm

Before ||| finished, the function done is called and it will set the current_test null.
here is the content of autoinst-status.json

[2020-05-28T10:38:24.0133 CST] [error] the test_status is {
  'status' => 'running',
  'test_execution_paused' => 0,
  'current_test' => undef
}

So from 10:37:47.904 to 10:38:25.640, the current_test is null, the log Test schedule has changed will be printed many times.

When all test modules have been finished, the current_test will be set null, but there is still other mission, e.g. creating qcow file. During this process, the current_test does not exists, and the log Test schedule has changed... will be printed many times. Here is an example:

[0m[37m[2020-05-27T17:50:26.901 CST] [debug] ||| finished shutdown_ltp kernel at 2020-05-27 09:50:26 (26 s)
[0m[37m[2020-05-27T17:50:26.903 CST] [debug] stopping autotest process 24786
[0m[37m[2020-05-27T17:50:26.909 CST] [debug] [autotest] process exited: 0
[0m[37m[2020-05-27T17:50:26.909 CST] [debug] done with autotest process
[0m[37m[2020-05-27T17:50:26.910 CST] [debug] terminating command server 24783 because test execution ended
[0m[37m[2020-05-27T17:50:26.910 CST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20013/4_8v1amG6lqwzlWH/broadcast
[0m[37m[2020-05-27T17:50:26.924 CST] [debug] commands process exited: 0
[0m[37m[2020-05-27T17:50:26.924 CST] [debug] done with command server
[0m[37m[2020-05-27T17:50:26.924 CST] [debug] isotovideo done
[0m[37m[2020-05-27T17:50:26.925 CST] [debug] BACKEND SHUTDOWN 1
[0m[33m[2020-05-27T17:50:26.926 CST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[0m[37m[2020-05-27T17:50:28.964 CST] [debug] flushing frames
[0m[37m[2020-05-27T17:50:28.991 CST] [debug] QEMU: qemu-system-x86_64: terminating on signal 15 from pid 24790 (/usr/bin/isotovideo: backen)
[0m[37m[2020-05-27T17:50:28.993 CST] [debug] sending magic and exit
[0m[37m[2020-05-27T17:50:28.993 CST] [debug] received magic close
[0m[37m[2020-05-27T17:50:29.026 CST] [debug] backend process exited: 0
[0m[33m[2020-05-27T17:50:29.027 CST] [info] ::: backend::qemu::do_extract_assets: Extracting (?^:^hd0$)
[0m[37m[2020-05-27T17:50:29.027 CST] [debug] running nice ionice qemu-img convert -c -O qcow2 /var/lib/openqa/pool/1/raid/hd0-overlay1 assets_public/sle-12-SP5-x86_64-4.12.14-212.1.gd51ffe9-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2
[0m[37m[2020-05-27T17:53:47.656 CST] [debug] running qemu-img info --output=json assets_public/sle-12-SP5-x86_64-4.12.14-212.1.gd51ffe9-Server-DVD-Incidents-Kernel@64bit-with-ltp.qcow2
[0m[37m[2020-05-27T17:53:47.676 CST] [debug] stopping backend process 24790
[0m[37m[2020-05-27T17:53:47.676 CST] [debug] done with backend process

And the log in worker-log

[2020-05-27T17:52:21.0689 CST] [error] the test_status is {
  'test_execution_paused' => 0,
  'status' => 'finished',
  'current_test' => ''
}

In above examples, the test schedule has not changed. Because of the current_test_module does not exist, there are many logs. And there is one problem, in these situations, the openQA will update the job modules again and again, because openQA will update job modules when test_order is not empty.

Actions

Copy link

#16