Project

General

Profile

Actions

action #158104

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker size:S

Added by zcjia 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP6-Online-ppc64le-ha_beta_supportserver@ppc64le-2g fails in
setup

https://openqa.suse.de/tests/13885455#step/setup/84 (see attachment p1.png)

https://openqa.suse.de/tests/13885471#step/setup/30 (see attachment p2.png) It missed "$" before "?".

https://openqa.suse.de/tests/13885404#step/setup/12 (see attachment p3.png)

https://openqa.suse.de/tests/13885407#step/setup/9 (see attachment p4.png)

I think this may related with the high work load of underlying ppc64 worker.

All on "mania"

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build 73.1 (current job)

Expected result

Last good: 67.1 (or more recent)

Suggestions

  • Identify the affected machines and workers, apply mitigations to prevent recurring typing issues, e.g. reducing CPU load
  • Restart related failed jobs
  • Identify follow-up tasks
  • Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
  • Make the alert for CPU load more strict - #158113
  • Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
  • Check existing ffmpeg processes on mania which take a lot of CPU time - #158116

Out of scope

Further details

Always latest result in this scenario: latest


Files

p2.png (53.3 KB) p2.png zcjia, 2024-03-27 06:52
p3.png (33.5 KB) p3.png zcjia, 2024-03-27 06:56
p4.png (31 KB) p4.png zcjia, 2024-03-27 06:57
p5.png (58.9 KB) p5.png zcjia, 2024-03-27 07:04
p6.png (28.7 KB) p6.png zcjia, 2024-03-27 07:07
p7.png (28.8 KB) p7.png zcjia, 2024-03-27 07:09
Screenshot from 2024-03-28 14-37-54.png (151 KB) Screenshot from 2024-03-28 14-37-54.png llzhao, 2024-03-28 06:38
Screenshot from 2024-03-28 14-37-43.png (109 KB) Screenshot from 2024-03-28 14-37-43.png llzhao, 2024-03-28 06:38

Related issues 4 (2 open2 closed)

Related to openQA Infrastructure - action #157636: remove NOVIDEO=1 from ppc64le workersNewzcjia2024-03-21

Actions
Copied to openQA Infrastructure - action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:MResolvedokurz2024-03-27

Actions
Copied to openQA Infrastructure - action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:MWorkable2024-03-27

Actions
Copied to openQA Project - action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
Actions #2

Updated by okurz 3 months ago

  • Related to action #157636: remove NOVIDEO=1 from ppc64le workers added
Actions #3

Updated by okurz 3 months ago · Edited

  • Project changed from openQA Tests to openQA Infrastructure
  • Category changed from Bugs in existing tests to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to Urgent
  • Target version set to Ready

I must say I am sorry I did not act earlier on this. I saw the consistent high CPU load on mania already days but I did not take actions.

Tasks:

  1. Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
  2. Make the alert for CPU load more strict - #158113
  3. Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
  4. Check existing ffmpeg processes on mania which take a lot of CPU time - #158116
Actions #4

Updated by okurz 3 months ago

  • Copied to action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:M added
Actions #5

Updated by okurz 3 months ago

  • Parent task set to #158110
Actions #6

Updated by okurz 3 months ago

  • Copied to action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:M added
Actions #7

Updated by okurz 3 months ago

  • Description updated (diff)
Actions #8

Updated by okurz 3 months ago · Edited

Called:

host=openqa.suse.de WORKER=mania failed_since=2024-03-25 result="result='failed'" comment="label:poo#158104" openqa-advanced-retrigger-jobs | grep -c 'jobs/.*/restart'
Actions #9

Updated by okurz 3 months ago

  • Copied to action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Actions #10

Updated by okurz 3 months ago

  • Due date set to 2024-04-10
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High
Actions #12

Updated by okurz 3 months ago

  • Subject changed from typing issue on ppc64 worker to typing issue on ppc64 worker size:S
  • Description updated (diff)
Actions #13

Updated by okurz 2 months ago

  • Due date deleted (2024-04-10)
  • Status changed from Feedback to Resolved

Checked history of mania:1 from https://openqa.suse.de/admin/workers/3442 and I see many ok jobs and a lot of failures but no obvious typing issues anymore. Follow-up tasks identified and reported in separate tickets.

Actions #14

Updated by openqa_review about 1 month ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: migration_offline_scc_sle15sp4_ha_alpha_node02
https://openqa.suse.de/tests/14253478#step/system_prepare/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #15

Updated by okurz about 1 month ago

  • Status changed from Feedback to Resolved

https://openqa.suse.de/tests/14253478#step/system_prepare/28 has nothing to do with the original issue but is some service taking too long.

openqa-query-for-job-label 158104 shows that only this job references the ticket. I removed the reference.

Actions

Also available in: Atom PDF