action #158104
closedopenQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances
openQA Project (public) - coordination #158110: [epic] Prevent worker overload
typing issue on ppc64 worker size:S
0%
Description
Observation¶
openQA test in scenario sle-15-SP6-Online-ppc64le-ha_beta_supportserver@ppc64le-2g fails in
setup
https://openqa.suse.de/tests/13885455#step/setup/84 (see attachment p1.png)
https://openqa.suse.de/tests/13885471#step/setup/30 (see attachment p2.png) It missed "$" before "?".
https://openqa.suse.de/tests/13885404#step/setup/12 (see attachment p3.png)
https://openqa.suse.de/tests/13885407#step/setup/9 (see attachment p4.png)
I think this may related with the high work load of underlying ppc64 worker.
All on "mania"
Test suite description¶
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.
Reproducible¶
Fails since (at least) Build 73.1 (current job)
Expected result¶
Last good: 67.1 (or more recent)
Suggestions¶
- Identify the affected machines and workers, apply mitigations to prevent recurring typing issues, e.g. reducing CPU load
- Restart related failed jobs
- Identify follow-up tasks
- Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
- Make the alert for CPU load more strict - #158113
- Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
- Check existing ffmpeg processes on mania which take a lot of CPU time - #158116
Out of scope¶
Further details¶
Always latest result in this scenario: latest
Files
Updated by zcjia 9 months ago
https://openqa.suse.de/tests/13885428#step/zypper_patch/14 (attachment p5.png)
https://openqa.suse.de/tests/13885349#step/check_after_reboot/28 (attachment p6.png) should be "digit" instead of "diit".
https://openqa.suse.de/tests/13885464#step/setup/86 (attachment p7.png)
Updated by okurz 9 months ago
- Related to action #157636: remove NOVIDEO=1 from ppc64le workers added
Updated by okurz 9 months ago · Edited
- Project changed from openQA Tests (public) to openQA Infrastructure (public)
- Category changed from Bugs in existing tests to Regressions/Crashes
- Status changed from New to In Progress
- Assignee set to okurz
- Priority changed from Normal to Urgent
- Target version set to Ready
I must say I am sorry I did not act earlier on this. I saw the consistent high CPU load on mania already days but I did not take actions.
Tasks:
- Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
- Make the alert for CPU load more strict - #158113
- Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
- Check existing ffmpeg processes on mania which take a lot of CPU time - #158116
Updated by okurz 9 months ago
- Copied to action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:M added
Updated by okurz 9 months ago
- Copied to action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:M added
Updated by okurz 9 months ago
- Copied to action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Updated by llzhao 9 months ago · Edited
- File Screenshot from 2024-03-28 14-37-54.png Screenshot from 2024-03-28 14-37-54.png added
- File Screenshot from 2024-03-28 14-37-43.png Screenshot from 2024-03-28 14-37-43.png added
- File deleted (
p1.png)
Still has performance issues, for example:
Missing characters when type_string():
https://openqa.suse.de/tests/13897020#step/drbd_passive/30
https://openqa.suse.de/tests/13896962#step/drbd_passive/26
https://openqa.suse.de/tests/13897051#step/hostname/32
Stall detected:
https://openqa.suse.de/tests/13897045#step/patch_sle/57
Updated by okurz 9 months ago
- Due date deleted (
2024-04-10) - Status changed from Feedback to Resolved
Checked history of mania:1 from https://openqa.suse.de/admin/workers/3442 and I see many ok jobs and a lot of failures but no obvious typing issues anymore. Follow-up tasks identified and reported in separate tickets.
Updated by openqa_review 7 months ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: migration_offline_scc_sle15sp4_ha_alpha_node02
https://openqa.suse.de/tests/14253478#step/system_prepare/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by okurz 7 months ago
- Status changed from Feedback to Resolved
https://openqa.suse.de/tests/14253478#step/system_prepare/28 has nothing to do with the original issue but is some service taking too long.
openqa-query-for-job-label 158104
shows that only this job references the ticket. I removed the reference.