Project

General

Profile

Actions

action #158104

closed

openQA Project - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project - coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker size:S

Added by zcjia about 1 month ago. Updated 24 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP6-Online-ppc64le-ha_beta_supportserver@ppc64le-2g fails in
setup

https://openqa.suse.de/tests/13885455#step/setup/84 (see attachment p1.png)

https://openqa.suse.de/tests/13885471#step/setup/30 (see attachment p2.png) It missed "$" before "?".

https://openqa.suse.de/tests/13885404#step/setup/12 (see attachment p3.png)

https://openqa.suse.de/tests/13885407#step/setup/9 (see attachment p4.png)

I think this may related with the high work load of underlying ppc64 worker.

All on "mania"

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build 73.1 (current job)

Expected result

Last good: 67.1 (or more recent)

Suggestions

  • Identify the affected machines and workers, apply mitigations to prevent recurring typing issues, e.g. reducing CPU load
  • Restart related failed jobs
  • Identify follow-up tasks
  • Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
  • Make the alert for CPU load more strict - #158113
  • Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
  • Check existing ffmpeg processes on mania which take a lot of CPU time - #158116

Out of scope

Further details

Always latest result in this scenario: latest


Files

p2.png (53.3 KB) p2.png zcjia, 2024-03-27 06:52
p3.png (33.5 KB) p3.png zcjia, 2024-03-27 06:56
p4.png (31 KB) p4.png zcjia, 2024-03-27 06:57
p5.png (58.9 KB) p5.png zcjia, 2024-03-27 07:04
p6.png (28.7 KB) p6.png zcjia, 2024-03-27 07:07
p7.png (28.8 KB) p7.png zcjia, 2024-03-27 07:09
Screenshot from 2024-03-28 14-37-54.png (151 KB) Screenshot from 2024-03-28 14-37-54.png llzhao, 2024-03-28 06:38
Screenshot from 2024-03-28 14-37-43.png (109 KB) Screenshot from 2024-03-28 14-37-43.png llzhao, 2024-03-28 06:38

Related issues 4 (2 open2 closed)

Related to openQA Infrastructure - action #157636: remove NOVIDEO=1 from ppc64le workersNewzcjia2024-03-21

Actions
Copied to openQA Infrastructure - action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:MResolvedokurz2024-03-27

Actions
Copied to openQA Infrastructure - action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:MWorkable2024-03-27

Actions
Copied to openQA Project - action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:MResolvedmkittler

Actions
Actions #2

Updated by okurz about 1 month ago

  • Related to action #157636: remove NOVIDEO=1 from ppc64le workers added
Actions #3

Updated by okurz about 1 month ago · Edited

  • Project changed from openQA Tests to openQA Infrastructure
  • Category changed from Bugs in existing tests to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to Urgent
  • Target version set to Ready

I must say I am sorry I did not act earlier on this. I saw the consistent high CPU load on mania already days but I did not take actions.

Tasks:

  1. Reduce the number of worker instances as a first mitigation measure. https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/759 (merged)
  2. Make the alert for CPU load more strict - #158113
  3. Evaluate the impact on video encoding in particular on ppc64le, maybe ffmpeg on Power8 kvm is inefficient - #158116
  4. Check existing ffmpeg processes on mania which take a lot of CPU time - #158116
Actions #4

Updated by okurz about 1 month ago

  • Copied to action #158113: typing issue on ppc64 worker - make CPU load alert more strict size:M added
Actions #5

Updated by okurz about 1 month ago

  • Parent task set to #158110
Actions #6

Updated by okurz about 1 month ago

  • Copied to action #158116: typing issue on ppc64 worker - crosscheck performance impact of ffmpeg on ppc64le (Power8 kvm) size:M added
Actions #7

Updated by okurz about 1 month ago

  • Description updated (diff)
Actions #8

Updated by okurz about 1 month ago · Edited

Called:

host=openqa.suse.de WORKER=mania failed_since=2024-03-25 result="result='failed'" comment="label:poo#158104" openqa-advanced-retrigger-jobs | grep -c 'jobs/.*/restart'
Actions #9

Updated by okurz about 1 month ago

  • Copied to action #158125: typing issue on ppc64 worker - only pick up (or start) new jobs if CPU load is below configured threshold size:M added
Actions #10

Updated by okurz about 1 month ago

  • Due date set to 2024-04-10
  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High
Actions #12

Updated by okurz about 1 month ago

  • Subject changed from typing issue on ppc64 worker to typing issue on ppc64 worker size:S
  • Description updated (diff)
Actions #13

Updated by okurz 24 days ago

  • Due date deleted (2024-04-10)
  • Status changed from Feedback to Resolved

Checked history of mania:1 from https://openqa.suse.de/admin/workers/3442 and I see many ok jobs and a lot of failures but no obvious typing issues anymore. Follow-up tasks identified and reported in separate tickets.

Actions

Also available in: Atom PDF