Project

General

Profile

action #37782

[kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes

Added by nicksinger over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2018-06-25
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run due to a broken pipe.

Suggestions to improve this test

To me this issue looks like some timeout after copying many files around. IMHO this can always happen if we relay on long open TCP sessions.
Without looking into test_fs_stress-run, I'd assume it uses SSH. If so, one could try to increase the ssh timeout value (https://askubuntu.com/questions/127369/how-to-prevent-write-failed-broken-pipe-on-ssh-connection)

Another idea would be to implement retries (e.g. only fail after 3 retries).

Reproducible

Fails (until now) only at build 0263 and should be sporadic.


Related issues

Related to openQA Tests - action #34012: [kernel] too generic test failure in "execute_test_run" for stress tests, was previously something more specific like "acceptance_fs_stress"Resolved2018-03-29

History

#1 Updated by nicksinger over 2 years ago

  • Subject changed from [functional][s390x][medium] test fails in execute_test_run because it cannot handle broken pipes to [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes

#2 Updated by okurz about 2 years ago

  • Related to action #34012: [kernel] too generic test failure in "execute_test_run" for stress tests, was previously something more specific like "acceptance_fs_stress" added

#3 Updated by okurz about 2 years ago

  • Subject changed from [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes to [kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes
  • Assignee set to yosun

Hi yosun, as discussed in #34012 I assume you want to pick it up?

#4 Updated by yosun about 2 years ago

  • Assignee changed from yosun to okurz

Thanks for the info.
This failed by "packet_write_wait: Connection to 10.161.145.16 port 22: Broken pipe", and checked serial log don't have any oops and crash info in it. It fails when doing "/usr/share/qa/tools/file_copy -j 4 -i 5 -s 5000", which means run in 4 parallels, iteration 5 times with 5000MB(5GB) files copy in 1 time. In this high fs stress, and this test KVM only has 912MB RAM, it's easy to cause OOM then system ramdom kill process, and then ssh service or related process being killed randomly.
Log just shows it lose ssh connect, neither a kernel nor test issue but a random issue. The solution is give larger RAM or harddisk for this poor s390x KVM to avoid/reduce this kind of issue. We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.
In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.

#5 Updated by okurz about 2 years ago

  • Status changed from New to Feedback

yosun wrote:

We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.

What "Lab's developer" are you referring to? Do you have a ticket for that?

In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.

I don't think this is related to the tools team because when we talk about KVM we just configure the machine accordingly. We could do that but I want to wait for your response first.

#6 Updated by okurz about 2 years ago

  • Target version set to future

#7 Updated by yosun about 2 years ago

I was wrong, it's not failed by OOM issue. I check test code and log in https://openqa.suse.de/tests/1777070/file/autoinst-log.txt again, I found this test randomly fail by following line timeout in 90 second:
assert_script_run("tar cjf $tarball -C /var/log/qa/ctcs2 ls /var/log/qa/ctcs2/");

It only fail after test fs_stress, and this line just after test finish. I guess after fs stress, system need more time to get enough space to create a log tarbal. I think the solution is add following lines before fail part:
if (get_var("QA_TESTSUITE")=="fs_stress") {
sleep 120;
}

#8 Updated by okurz about 2 years ago

I guess rather than this big sleep time we should wait for what we really need, e.g. look for free space and wait until there is more free space again. Or just save to a different location, e.g. ram disk /dev/shm

#9 Updated by okurz about 2 years ago

  • Assignee changed from okurz to yosun

yosun anything else what I could help with here?

#10 Updated by yosun about 2 years ago

It's helpful in #8, thanks! But I still didn't find time to work on it. Maybe I could work on it in SLE12SP4 RC period.

#11 Updated by yosun almost 2 years ago

  • Status changed from Feedback to Resolved

Fixed with add a sync before tar logs. I tried some tests not reproduce this issue. Feel free to reopen it, when happen again.

Also available in: Atom PDF