action #37000
closed[opensuse][functional][u][sporadic] test fails in reboot_plasma5 - either stuck in shutdown or not enough waiting time for grub2?
0%
Description
Observation¶
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-update_Leap_42.1_kde@64bit fails in
reboot_plasma5
Reproducible¶
Fails since (at least) Build 20180606 (current job)
Expected result¶
Last good: 20180605 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by okurz over 6 years ago
- Due date set to 2018-07-17
- Target version set to Milestone 17
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 17
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 18
Updated by okurz over 6 years ago
- Due date changed from 2018-07-17 to 2018-07-31
It's hackweek time!
Updated by zluo over 6 years ago
- Status changed from New to In Progress
- Assignee set to zluo
take over
Updated by zluo over 6 years ago
https://openqa.opensuse.org/tests/700235#step/reboot_plasma5/5
3 days ago still show this issue. The latest test run looks good.
Updated by okurz over 6 years ago
- Subject changed from [opensuse][functional][u] test fails in reboot_plasma5 - extend wait time to load grub2 to [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - extend wait time to load grub2
From the job history I can see that the test module failure is sporadic, i.e. we need better statistics than just single jobs -> https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
Updated by okurz over 6 years ago
Updated by okurz over 6 years ago
- Status changed from In Progress to Resolved
Updated by mloviska over 6 years ago
Issue has occurred again:
- opensuse-Tumbleweed-NET-x86_64-Build20180719-update_13.2@64bit
- opensuse-Tumbleweed-NET-x86_64-Build20180719-zdup-Leap-42.1-kde@64bit
Should we reopen this ticket ?
Updated by mloviska over 6 years ago
- Status changed from Resolved to Workable
Failing jobs:
- opensuse-Tumbleweed-NET-x86_64-Build20180719-update_Leap_42.2_kde@64bit
- opensuse-Tumbleweed-NET-x86_64-Build20180719-update_Leap_42.3_kde+system_performance@64bit
- opensuse-Tumbleweed-NET-x86_64-Build20180719-zdup-Leap-42.3-kde@64bit
- opensuse-Tumbleweed-NET-x86_64-Build20180719-update_Leap_42.1_kde@64bit
Updated by okurz over 6 years ago
I see. Thanks for the observation. I guess we need to handle the longish shutdown better -> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5314 from @oorlov
Updated by okurz over 6 years ago
- Related to action #37003: [opensuse][functional][u][sporadic] test fails in network_configuration - xterm does not start added
Updated by okurz over 6 years ago
- Subject changed from [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - extend wait time to load grub2 to [opensuse][functional][u][sporadic] test fails in reboot_plasma5 - either stuck in shutdown or not enough waiting time for grub2?
- Due date changed from 2018-07-31 to 2018-08-14
- Status changed from Workable to Blocked
Updated by okurz over 6 years ago
- Status changed from Blocked to Feedback
Seeing latest example https://openqa.opensuse.org/tests/714457#step/reboot_plasma5/3 this clearly looks again like we just do not wait long enough.
-> https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/5476
Updated by okurz over 6 years ago
PR merged.
Latest job has been running only before the PR was merged so I guess we need to collect better statistics … by waiting
Updated by oorlov over 6 years ago
Finally the PR with the ability to gather more logs on shutdown is merged.
So, I've added DEBUG_SHUTDOWN=1 property to 'kde' scenarios.
Let's see after several executions if it will give us some meaningful logs.
Updated by okurz over 6 years ago
- Assignee changed from okurz to oorlov
Please closely monitor the scenario then to make sure this does not introduce even more failures.
Updated by okurz over 6 years ago
I recommend to make use of the weekend capacity and trigger some more jobs for statistical investigation on o3 to crosscheck.
Updated by oorlov over 6 years ago
I've increased timeout with TIME_SCALE=3 for the test suite and the test is passed.
https://openqa.opensuse.org/tests/724980
https://openqa.opensuse.org/tests/724981
The huge time (~90 sec) when the system waits is between that steps:
[ 3159.891194] display-manager[5540]: Shutting down service sddm..done
[ 3248.440114] systemd[1807]: dbus.service: State 'stop-final-sigterm' timed out. Killing.
Updated by okurz over 6 years ago
- Due date changed from 2018-08-14 to 2018-08-28
bulk move to next sprint as could not be discussed in SR
Updated by SLindoMansilla over 6 years ago
okurz, why is this ticket in feedback, I cannot see any open PR nor waiting for any verification run. Can I change the status to "in progress"?
Updated by okurz over 6 years ago
oorlov wrote:
I've increased timeout with TIME_SCALE=3 for the test suite and the test is passed.
https://openqa.opensuse.org/tests/724980
https://openqa.opensuse.org/tests/724981The huge time (~90 sec) when the system waits is between that steps:
[ 3159.891194] display-manager[5540]: Shutting down service sddm..done
[ 3248.440114] systemd[1807]: dbus.service: State 'stop-final-sigterm' timed out. Killing.
I am not sure if we actually need any change in os-autoinst although your PR there looks fine.
For issues like the one above I recommend to debug further what is the big waiting time in between and report a bug.
Can you take a look into https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/force_scheduled_tasks.pm#L29 and apply the same method here? It should be ok to just call assert_shutdown with a huge timeout and record a soft failure depending on the actual elapsed time
Updated by mgriessmeier over 6 years ago
- Due date changed from 2018-08-28 to 2018-09-11
Updated by mgriessmeier over 6 years ago
- Due date changed from 2018-09-11 to 2018-09-25
let's discuss the state offline
Updated by oorlov over 6 years ago
- Status changed from Feedback to Resolved
After updating shutdown module and scaling timeout with with TIME_SCALE=3, the modules never failed on reboot_plasma5 or shutdown module (in the last more than 20 builds).
I've checked the job in the last 6 builds, shutdown never took more than 15 seconds to be finished.
I'm closing the ticket as 'Resolved', as the issue is not reproduced anymore.
Updated by okurz over 6 years ago
good observation. Do we actually set TIMEOUT_SCALE anywhere or was this just just used for investigation?
Updated by okurz almost 6 years ago
- Copied to action #47246: [opensuse][functional][y] Get rid of TIMEOUT_SCALE in kde testsuite on o3 if still there or adjust test suite added
Updated by okurz almost 5 years ago
As I realized myself I can answer my question in #3700#note-28 : We do set "TIMEOUT_SCALE=3" in tests, e.g. as visible in https://openqa.opensuse.org/tests/1171538# which also the test suite description reflects. TIMEOUT_SCALE is meant as a temporary measure or for really slow workers which we should not have at all in production. Where necessary we should bump internal timeouts, then remove TIMEOUT_SCALE and also adjust the testsuite settings please. I recorded this in #63388 rather than reopening this ticket which is already a bit old.