action #37782

[kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes

Added by nicksinger over 1 year ago. Updated over 1 year ago.

Status:ResolvedStart date:25/06/2018
Priority:NormalDue date:
Assignee:yosun% Done:

0%

Category:Bugs in existing tests
Target version:QA - future
Difficulty:
Duration:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-s390x-fs_stress@s390x-kvm-sle12 fails in
execute_test_run due to a broken pipe.

Suggestions to improve this test

To me this issue looks like some timeout after copying many files around. IMHO this can always happen if we relay on long open TCP sessions.
Without looking into test_fs_stress-run, I'd assume it uses SSH. If so, one could try to increase the ssh timeout value (https://askubuntu.com/questions/127369/how-to-prevent-write-failed-broken-pipe-on-ssh-connection)

Another idea would be to implement retries (e.g. only fail after 3 retries).

Reproducible

Fails (until now) only at build 0263 and should be sporadic.


Related issues

Related to openQA Tests - action #34012: [kernel] too generic test failure in "execute_test_run" f... Resolved 29/03/2018

History

#1 Updated by nicksinger over 1 year ago

  • Subject changed from [functional][s390x][medium] test fails in execute_test_run because it cannot handle broken pipes to [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes

#2 Updated by okurz over 1 year ago

  • Related to action #34012: [kernel] too generic test failure in "execute_test_run" for stress tests, was previously something more specific like "acceptance_fs_stress" added

#3 Updated by okurz over 1 year ago

  • Subject changed from [functional][medium][u] test fails in execute_test_run because it cannot handle broken pipes to [kernel][functional][u][medium] test fails in execute_test_run because it cannot handle broken pipes
  • Assignee set to yosun

Hi @yosun, as discussed in #34012 I assume you want to pick it up?

#4 Updated by yosun over 1 year ago

  • Assignee changed from yosun to okurz

Thanks for the info.
This failed by "packet_write_wait: Connection to 10.161.145.16 port 22: Broken pipe", and checked serial log don't have any oops and crash info in it. It fails when doing "/usr/share/qa/tools/file_copy -j 4 -i 5 -s 5000", which means run in 4 parallels, iteration 5 times with 5000MB(5GB) files copy in 1 time. In this high fs stress, and this test KVM only has 912MB RAM, it's easy to cause OOM then system ramdom kill process, and then ssh service or related process being killed randomly.
Log just shows it lose ssh connect, neither a kernel nor test issue but a random issue. The solution is give larger RAM or harddisk for this poor s390x KVM to avoid/reduce this kind of issue. We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.
In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.

#5 Updated by okurz over 1 year ago

  • Status changed from New to Feedback

yosun wrote:

We have reported related issue as bug to Lab team, but Lab's developer didn't take it because of resource too limited.

What "Lab's developer" are you referring to? Do you have a ticket for that?

In all, I suggest solve it in tools team to give more resources to this KVM or just reject this kind of ticket.

I don't think this is related to the tools team because when we talk about KVM we just configure the machine accordingly. We could do that but I want to wait for your response first.

#6 Updated by okurz over 1 year ago

  • Target version set to future

#7 Updated by yosun over 1 year ago

I was wrong, it's not failed by OOM issue. I check test code and log in https://openqa.suse.de/tests/1777070/file/autoinst-log.txt again, I found this test randomly fail by following line timeout in 90 second:
assert_script_run("tar cjf $tarball -C /var/log/qa/ctcs2 ls /var/log/qa/ctcs2/");

It only fail after test fs_stress, and this line just after test finish. I guess after fs stress, system need more time to get enough space to create a log tarbal. I think the solution is add following lines before fail part:
if (get_var("QA_TESTSUITE")=="fs_stress") {
sleep 120;
}

#8 Updated by okurz over 1 year ago

I guess rather than this big sleep time we should wait for what we really need, e.g. look for free space and wait until there is more free space again. Or just save to a different location, e.g. ram disk /dev/shm

#9 Updated by okurz over 1 year ago

  • Assignee changed from okurz to yosun

@yosun anything else what I could help with here?

#10 Updated by yosun over 1 year ago

It's helpful in #8, thanks! But I still didn't find time to work on it. Maybe I could work on it in SLE12SP4 RC period.

#11 Updated by yosun over 1 year ago

  • Status changed from Feedback to Resolved

Fixed with add a sync before tar logs. I tried some tests not reproduce this issue. Feel free to reopen it, when happen again.

Also available in: Atom PDF