action #33529: isotovideo: backend takes 100 % of CPU when driving svirt job - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #33529

closed

isotovideo: backend takes 100 % of CPU when driving svirt job

Added by michalnowak about 7 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Done

Start date:

2018-03-20

Due date:

% Done:

Estimated time:

Description

isotovideo: backend takes 100 % of CPU when it drives svirt job. It does not when it drives qemu job.

Actions

Copy link

Updated by dasantiago about 7 years ago

Does this happens in some specific part of the test, like on a restart or it's always 100% on CPU?

Actions

Copy link

Updated by michalnowak about 7 years ago

On both Xen HVM & Hyper-V it happens at the end of bootloader_svirt / at the beginning of bootloader_uefi. On that boundary is switch from svirt to sut console.

Actions

Copy link

Updated by dasantiago about 7 years ago

michalnowak wrote:

On both Xen HVM & Hyper-V it happens at the end of bootloader_svirt / at the beginning of bootloader_uefi. On that boundary is switch from svirt to sut console.

Then, it looks like it's because of the polling of the serial console... Don't you agree? Or the CPU usage don't estabilize after that?

Actions

Copy link

Updated by michalnowak about 7 years ago

Looks like this, tracked it to define_and_start.

Actions

Copy link

Updated by michalnowak almost 7 years ago

Perhaps the 100% CPU utilization harms the shared believe that two svirt worker can replace one qemu worker? It still should be true for disk IO, but CPU time is probably affected greatly. Also running more than two svirt jobs on laptop makes the fan go crazy.

Actions

Copy link

Updated by coolo almost 6 years ago

this is still the case, right?

Actions

Copy link

Updated by michalnowak almost 6 years ago

Yes, it is.

Actions

Copy link

Updated by coolo almost 6 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by okurz almost 6 years ago

Category changed from 132 to Feature requests

Actions

Copy link

#10

Updated by mkittler over 5 years ago

Status changed from New to In Progress
Assignee set to mkittler
Target version changed from Ready to Current Sprint

@dasantiago is right. I've added some debug printing in the relevant functions in baseclass.pm to confirm the theory. There is also already a related warning visible in the log:

alling Net::SSH2::Channel::readline in non-blocking mode is usually a programming error at /hdd/openqa-devel/repos/os-autoinst/backend/baseclass.pm line 1225.

It likely can't be made blocking without impairing the backend's responsiveness. I have to dig into the backend code to find a solution. It might not be trivial.

Actions

Copy link

#11

Updated by mkittler over 5 years ago

The code actually uses IO::Select to only read from the SSH channel when the underlying socket is ready to read. But apparently that's not sufficient. The socket appears to be always ready to read although reading from the SSH channel mostly results in the error "operation would block".

I changed the code from reading line by line to use Net::SSH2::Channel::read2 so the extended data would be consumed as well. However, that doesn't change a thing.

So I'm not sure how to integrate Net::SSH2::Channel into our async processing.

Actions

Copy link

#12

Updated by coolo over 5 years ago

Category changed from Feature requests to Regressions/Crashes

Actions

Copy link

#13

Updated by mkittler over 5 years ago

Apparently the SSH socket was just passed to the write FDs for IO::Select. This PR attempts to fix it: https://github.com/os-autoinst/os-autoinst/pull/1239

It actually decreases the CPU usage to almost nothing. However, it seems to break other things (or it is just my local setup).

Actions

Copy link

#14

Updated by mkittler over 5 years ago

Status changed from In Progress to Feedback

The PR has been merged but not deployed on all relevant production workers.

Actions

Copy link

#15

Updated by mkittler over 5 years ago

Status changed from Feedback to Resolved
Target version changed from Current Sprint to Done

I've just had a look at the CPU usage on openqaworker2. It runs a few svirt jobs but none of the cores is constantly busy.

The change likely caused a regression. There's another ticket for it so I'll close this one.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #33529

isotovideo: backend takes 100 % of CPU when driving svirt job

Updated by dasantiago about 7 years ago

Updated by michalnowak about 7 years ago

Updated by dasantiago about 7 years ago

Updated by michalnowak about 7 years ago

Updated by michalnowak almost 7 years ago

Updated by coolo almost 6 years ago

Updated by michalnowak almost 6 years ago

Updated by coolo almost 6 years ago

Updated by okurz almost 6 years ago

Updated by mkittler over 5 years ago

Updated by mkittler over 5 years ago

Updated by coolo over 5 years ago

Updated by mkittler over 5 years ago

Updated by mkittler over 5 years ago

Updated by mkittler over 5 years ago