Project

General

Profile

action #93112

[qe-core][s390x] bootloader_zkvm fails: Cannot allocate memory

Added by MDoucha 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Enhancement to existing tests
Target version:
-
Start date:
2021-05-25
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

s390 jobs randomly fail in bootloader_zkvm. autoinst-log.txt shows the following error:

[debug] [run_ssh_cmd(virsh  start openQA-SUT-4 2> >(tee /tmp/os-autoinst-openQA-SUT-4-stderr.log >&2))] stderr:
error: Failed to start domain openQA-SUT-4
error: internal error: qemu unexpectedly closed the monitor: 2021-05-18T11:23:21.183643Z qemu-system-s390x: cannot set up guest memory 's390.ram': Cannot allocate memory

https://openqa.suse.de/tests/6044126#step/bootloader_zkvm/28
https://openqa.suse.de/tests/6044006#step/bootloader_zkvm/28

This appears to be the same problem as #45326 and #48404.

Additional links: latest job with bootloader_zkvm

History

#1 Updated by okurz 5 months ago

  • Subject changed from [s390] bootloader_zkvm fails: Cannot allocate memory to [qe-core][s390x] bootloader_zkvm fails: Cannot allocate memory
  • Category set to Enhancement to existing tests

Thank you for reporting a ticket. I am convinced that this way we can followup individual issues much better than with just comments on a confluence page (which also has multiple other drawbacks).

  1. MDoucha if you open a ticket with a keyword in the subject line but without a team tag then it is likely to be overlooked, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging
  2. I am curious. Why did you not use the "Report test issue" button on openQA to create this ticket? This would have included other helpful information in the ticket, e.g. a link to "latest" of a scenario where the problem was observed so that ticket reviewers can get more context and get an impression of failure statistics
  3. For the imminent issue retriggering can increase the likelyhood of success when it's only a temporary issue. For this I recommend to use "auto-review", see https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger . If you need help adjusting the ticket accordingly please speak up and I can show you here as an example how it works
  4. As explained on https://confluence.suse.com/display/~vpelcak/Draft+-+Change+in+openQA+QE+Maintenance+Review as well in general os-autoinst-distri-opensuse as well as the specific backend implementations in os-autoinst except more generic ones like qemu are out of scope for SUSE QE Tools but here I feel like we might be able to come up with a more generic improvement. We already have the concept of "assigning" jobs to workers where the worker is able to give back jobs to the openQA scheduler in case the worker is not able to execute the assigned job. Maybe it could be possible to abort one test run early without ending up with an incomplete, keeping the job in the openQA schedule and trying again. This of course can not fix the problem that not enough memory is available on the backend host. Recorded the idea in #65271

#2 Updated by MDoucha 5 months ago

  • Description updated (diff)

okurz wrote:

Thank you for reporting a ticket. I am convinced that this way we can followup individual issues much better than with just comments on a confluence page (which also has multiple other drawbacks).

And I'm convinced that bootloader failures are something that should be monitored by the Tools team through a dashboard or direct database query and handled proactively. Never mind that this error has been reported and fixed before and it is most likely the result of worker overload.

  1. MDoucha if you open a ticket with a keyword in the subject line but without a team tag then it is likely to be overlooked, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging

How, pray tell, am I supposed to know which team this ticket should be assigned to? That is to be decided in triage.

  1. I am curious. Why did you not use the "Report test issue" button on openQA to create this ticket? This would have included other helpful information in the ticket, e.g. a link to "latest" of a scenario where the problem was observed so that ticket reviewers can get more context and get an impression of failure statistics

Because that "other helpful information" is nothing but useless clutter. But sure, let me add a link to the latest 15-SP2 install_ltp job to the ticket description. If you think it'll provide any insight into an issue that spans a dozen job groups, multiple flavors and about a hundred different testsuites that you won't see in the "Next & previous results" tab.

  1. For the imminent issue retriggering can increase the likelyhood of success when it's only a temporary issue. For this I recommend to use "auto-review", see https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger . If you need help adjusting the ticket accordingly please speak up and I can show you here as an example how it works

I have a better idea: Fix the problem.

#3 Updated by mgriessmeier 5 months ago

for reference: the s390 LPAR has been upgraded from 12GB to 24GB RAM a few minutes ago

#4 Updated by okurz 4 months ago

MDoucha wrote:

okurz wrote:

Thank you for reporting a ticket. I am convinced that this way we can followup individual issues much better than with just comments on a confluence page (which also has multiple other drawbacks).

And I'm convinced that bootloader failures are something that should be monitored by the Tools team through a dashboard or direct database query and handled proactively. Never mind that this error has been reported and fixed before and it is most likely the result of worker overload.

Well, I understand how people could expect that but this is not how it currently works. I understand how it can be frustrating if one thinks "this error is happening so often, why is no one fixing that?". Often there is a simple reason: Because no one else looked there. I was and I am referring to https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope not to piss off people but to make sure we have the same understanding. All except one member of the SUSE QE Tools team have been assigned to the team to do "software development only" and I am even forcing them to take more responsibilities, e.g. as described on https://progress.opensuse.org/projects/qa/wiki/Wiki#Common-tasks-for-team-members to do much more and e.g. monitor our infrastructure. But of course feel free to talk to everybody from the team to get to know them better.

Regarding what the Tools team monitors: For example based on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 we monitor the overall schedule of jobs, that jobs are started within a reasonable time, number of incomplete jobs based on the specific "reason".

So in general we do not look at "failed" jobs to review them.

  1. MDoucha if you open a ticket with a keyword in the subject line but without a team tag then it is likely to be overlooked, see https://progress.opensuse.org/projects/openqatests/wiki#ticket-backlog-triaging

How, pray tell, am I supposed to know which team this ticket should be assigned to? That is to be decided in triage.

My point was: Because you added a keyword right away the ticket would not show up in the query "subject without [" anymore hence nobody would "triage" it into a team because mostly nobody looks into tickets which have a "[" in the subject line but might not match one of the existing team's queries.

  1. I am curious. Why did you not use the "Report test issue" button on openQA to create this ticket? This would have included other helpful information in the ticket, e.g. a link to "latest" of a scenario where the problem was observed so that ticket reviewers can get more context and get an impression of failure statistics

Because that "other helpful information" is nothing but useless clutter. But sure, let me add a link to the latest 15-SP2 install_ltp job to the ticket description. If you think it'll provide any insight into an issue that spans a dozen job groups, multiple flavors and about a hundred different testsuites that you won't see in the "Next & previous results" tab.

My experience tells me that many people, when looking at a ticket with single job URLs pointing to jobs that might have been deleted already at that time simply reject the ticket and look no further assuming that "probably the problem has gone away". You can also provide a section "how to reproduce" or "how to find more jobs".

  1. For the imminent issue retriggering can increase the likelyhood of success when it's only a temporary issue. For this I recommend to use "auto-review", see https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger . If you need help adjusting the ticket accordingly please speak up and I can show you here as an example how it works

I have a better idea: Fix the problem.

I was providing something that could have helped you and others in the meantime while this issue is not fixed. And as you can see the issue is already 17 days old and not much has happened. But I guess you just want to mock me.

#5 Updated by MDoucha 4 months ago

okurz wrote:

MDoucha wrote:

And I'm convinced that bootloader failures are something that should be monitored by the Tools team through a dashboard or direct database query and handled proactively. Never mind that this error has been reported and fixed before and it is most likely the result of worker overload.

Well, I understand how people could expect that but this is not how it currently works. I understand how it can be frustrating if one thinks "this error is happening so often, why is no one fixing that?". Often there is a simple reason: Because no one else looked there. I was and I am referring to https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope not to piss off people but to make sure we have the same understanding. All except one member of the SUSE QE Tools team have been assigned to the team to do "software development only" and I am even forcing them to take more responsibilities, e.g. as described on https://progress.opensuse.org/projects/qa/wiki/Wiki#Common-tasks-for-team-members to do much more and e.g. monitor our infrastructure. But of course feel free to talk to everybody from the team to get to know them better.

Regarding what the Tools team monitors: For example based on https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1 we monitor the overall schedule of jobs, that jobs are started within a reasonable time, number of incomplete jobs based on the specific "reason".

So in general we do not look at "failed" jobs to review them.

From my point of view, the Tools team's responsibility over OpenQA ends when you hand me a fully booted and logged in system where I can run my tests. Finding all boot failures over the past 3 months takes about 5 lines of SQL so nobody on your team needs to monitor all jobs. Just run a script once a month or two and look at the dozen jobs it'll print whether there's a bug that can be fixed. Fixing bugs is part of software development.

I was and I am referring to https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope not to piss off people but to make sure we have the same understanding.

What do you expect when your first reaction to a bug report is posting a link that essentially says "that's not my problem"?

My point was: Because you added a keyword right away the ticket would not show up in the query "subject without [" anymore hence nobody would "triage" it into a team because mostly nobody looks into tickets which have a "[" in the subject line but might not match one of the existing team's queries.

May I suggest using the "Target version" field to track triage status instead?

My experience tells me that many people, when looking at a ticket with single job URLs pointing to jobs that might have been deleted already at that time simply reject the ticket and look no further assuming that "probably the problem has gone away". You can also provide a section "how to reproduce" or "how to find more jobs".

I find it hard to imagine that somebody will click a link to "latest job in group", find out that it has passed without error and reach a different conclusion than "probably the problem has gone away". This issue is not reproducible by the usual means. There is currently no way to find more jobs with the same failure other than running a custom SQL query on the OpenQA database.

I have a better idea: Fix the problem.

I was providing something that could have helped you and others in the meantime while this issue is not fixed. And as you can see the issue is already 17 days old and not much has happened. But I guess you just want to mock me.

I came here to report an OpenQA issue so that it can be fixed. If I wanted to sweep the problem under a rug, I would have come up with some way to do that myself, thankyouverymuch. Don't waste my time with things that won't help fix the problem and I won't feel the urge to be sarcastic in response. Focus on the issue at hand and we'll both be happy.

Anyway, I haven't seen this issue in OpenQA since the LPAR memory upgrade mentioned by Matthias above, though there haven't been that many kernel jobs since then. We have a new batch of livepatches now so I'll do another scan for errors when it finishes and if there are no bootloader_zkvm failures, I'll close this ticket as fixed.

#6 Updated by okurz 4 months ago

MDoucha wrote:

[…]
From my point of view, the Tools team's responsibility over OpenQA ends when you hand me a fully booted and logged in system where I can run my tests. Finding all boot failures over the past 3 months takes about 5 lines of SQL so nobody on your team needs to monitor all jobs. Just run a script once a month or two and look at the dozen jobs it'll print whether there's a bug that can be fixed. Fixing bugs is part of software development.

Regardless of who will do that the specific suggestion is good.

I was and I am referring to https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope not to piss off people but to make sure we have the same understanding.

What do you expect when your first reaction to a bug report is posting a link that essentially says "that's not my problem"?

I expect for the reader to look for the right team that has it in scope then.

My point was: Because you added a keyword right away the ticket would not show up in the query "subject without [" anymore hence nobody would "triage" it into a team because mostly nobody looks into tickets which have a "[" in the subject line but might not match one of the existing team's queries.

May I suggest using the "Target version" field to track triage status instead?

well, that's what we within the tools team already do.

[…]

I have a better idea: Fix the problem.

I was providing something that could have helped you and others in the meantime while this issue is not fixed. And as you can see the issue is already 17 days old and not much has happened. But I guess you just want to mock me.

I came here to report an OpenQA issue so that it can be fixed. If I wanted to sweep the problem under a rug, I would have come up with some way to do that myself, thankyouverymuch. Don't waste my time with things that won't help fix the problem and I won't feel the urge to be sarcastic in response. Focus on the issue at hand and we'll both be happy.

Anyway, I haven't seen this issue in OpenQA since the LPAR memory upgrade mentioned by Matthias above, though there haven't been that many kernel jobs since then. We have a new batch of livepatches now so I'll do another scan for errors when it finishes and if there are no bootloader_zkvm failures, I'll close this ticket as fixed.

The problem can simply be triggered by anyone triggering a job with artificially high memory configuration value or spawning any combination of jobs with the amount of configured memory exceeding the available memory. This is specific about the s390x KVM implementation. For other backends the jobs will incomplete and not even start to display any results. And incompletes are monitored handled by SUSE QE Tools. Here instead the jobs start and fail, i.e. end up "failed", not incomplete as they should.

#7 Updated by MDoucha 4 months ago

  • Status changed from New to Resolved

There were no bootloader_zkvm failures in the last batch of livepatches, closing as fixed.

Also available in: Atom PDF