Project

General

Profile

Actions

action #19780

closed

[sle][functional][aarch64][x86_64] test fails in yast2_lan and all upcoming modules, investigate failure, check worker logs

Added by riafarov almost 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Bugs in existing tests
Start date:
2017-06-13
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Something terribly bad happened during execution, need to check if that was problem with a worker, or any product degradation.

openQA test in scenario sle-12-SP3-Server-DVD-aarch64-sdk+allpatterns@aarch64 fails in
yast2_lan

Reproducible

Fails since (at least) Build 0424 (current job)

Expected result

Last good: 0422 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Has duplicate openQA Tests - action #18010: test fails in yast2_lan as first module and then all after that after what looks like giant slowdown on workerRejected2017-03-24

Actions
Actions #1

Updated by riafarov almost 7 years ago

  • Subject changed from [sle][functional][aarch64] test fails in yast2_lan and all upcoming modules, investigate failure, checl worker logs to [sle][functional][aarch64] test fails in yast2_lan and all upcoming modules, investigate failure, check worker logs

No info in journal for this period of time. sar not configured to have better logs. Worker restarted at around 1pm on 13.06

Actions #2

Updated by riafarov almost 7 years ago

  • Status changed from New to Feedback

Monitoring status of the issue.

Actions #3

Updated by okurz almost 7 years ago

  • Subject changed from [sle][functional][aarch64] test fails in yast2_lan and all upcoming modules, investigate failure, check worker logs to [sle][functional][aarch64][x86_64] test fails in yast2_lan and all upcoming modules, investigate failure, check worker logs
  • Priority changed from Normal to High

https://openqa.suse.de/tests/1019626#step/yast2_lan/2 and the following steps could be the same. One module fails to complete a script call and then all following modules never successfully execute. The commands show up in the SUT so it's not fully unresponsive. Also https://openqa.suse.de/tests/1019626#step/curl_https/12 shows a "segmentation fault". The worker instance https://openqa.suse.de/admin/workers/369 showed up to be fine for some time except this incomplete. This could be a severe product issue but to report anything useful we need to find more related failures or a way to reproduce.

Actions #4

Updated by SLindoMansilla over 6 years ago

newer cases of unresponsive system on aarch64 between snapshots:

https://openqa.suse.de/tests/1025530#step/shutdown/1

Actions #5

Updated by riafarov over 6 years ago

Similar problem on another worker: https://openqa.suse.de/tests/1054061

Actions #6

Updated by riafarov over 6 years ago

Similar issue:
https://openqa.suse.de/tests/1057968
Fishy lines in logs:
09:27:58.7276 3160 EVENT {"timestamp":{"seconds":1499847122,"microseconds":559018},"data":{"server":{"host":"0.0.0.0","auth":"none","family":"ipv4","websocket":false,"service":"5994"},"client":{"host":"127.0.0.1","service":"47128","websocket":false,"family":"ipv4"}},"event":"VNC_INITIALIZED"}
09:27:58.7278 3160 EVENT {"event":"VNC_DISCONNECTED","data":{"server":{"websocket":false,"family":"ipv4","service":"5994","auth":"none","host":"0.0.0.0"},"client":{"host":"127.0.0.1","service":"47128","family":"ipv4","websocket":false}},"timestamp":{"seconds":1499847222,"microseconds":876972}}
09:27:58.7279 3160 EVENT {"event":"VNC_CONNECTED","data":{"client":{"host":"127.0.0.1","websocket":false,"family":"ipv4","service":"47356"},"server":{"auth":"none","host":"0.0.0.0","family":"ipv4","websocket":false,"service":"5994"}},"timestamp":{"microseconds":877318,"seconds":1499847222}}
09:27:58.7280 3160 EVENT {"timestamp":{"seconds":1499847222,"microseconds":881498},"data":{"client":{"family":"ipv4","websocket":false,"service":"47356","host":"127.0.0.1"},"server":{"host":"0.0.0.0","auth":"none","service":"5994","websocket":false,"family":"ipv4"}},"event":"VNC_INITIALIZED"}
09:27:58.7282 3160 EVENT {"event":"VNC_DISCONNECTED","data":{"client":{"host":"127.0.0.1","websocket":false,"family":"ipv4","service":"47356"},"server":{"auth":"none","host":"0.0.0.0","family":"ipv4","websocket":false,"service":"5994"}},"timestamp":{"microseconds":205885,"seconds":1499847269}}

In the original issue we also have similar messages:
17:55:12.8539 23793 Loading a VM snapshot lastgood
17:55:14.6996 23796 EVENT {"event":"NIC_RX_FILTER_CHANGED","data":{"path":"/machine/peripheral-anon/device[1]/virtio-backend"},"timestamp":{"seconds":1497287396,"microseconds":552083}}
17:55:14.7000 23796 EVENT {"timestamp":{"seconds":1497288742,"microseconds":957668},"data":{"client":{"host":"127.0.0.1","service":"58154","websocket":false,"family":"ipv4"},"server":{"websocket":false,"family":"ipv4","auth":"none","service":"5991","host":"0.0.0.0"}},"event":"VNC_DISCONNECTED"}
17:55:14.7001 23796 EVENT {"event":"VNC_CONNECTED","data":{"client":{"service":"59904","host":"127.0.0.1","websocket":false,"family":"ipv4"},"server":{"websocket":false,"family":"ipv4","auth":"none","host":"0.0.0.0","service":"5991"}},"timestamp":{"seconds":1497288742,"microseconds":958379}}
17:55:14.7002 23796 EVENT {"data":{"server":{"websocket":false,"family":"ipv4","auth":"none","host":"0.0.0.0","service":"5991"},"client":{"family":"ipv4","websocket":false,"service":"59904","host":"127.0.0.1"}},"event":"VNC_INITIALIZED","timestamp":{"seconds":1497288742,"microseconds":961827}}
17:55:14.7003 23796 EVENT {"timestamp":{"microseconds":266414,"seconds":1497288789},"event":"VNC_DISCONNECTED","data":{"client":{"host":"127.0.0.1","service":"59904","family":"ipv4","websocket":false},"server":{"auth":"none","host":"0.0.0.0","service":"5991","websocket":false,"family":"ipv4"}}}
17:55:14.7005 23796 EVENT {"timestamp":{"seconds":1497288789,"microseconds":266914},"event":"VNC_CONNECTED","data":{"client":{"host":"127.0.0.1","service":"59962","websocket":false,"family":"ipv4"},"server":{"websocket":false,"family":"ipv4","auth":"none","host":"0.0.0.0","service":"5991"}}}
17:55:14.7006 23796 EVENT {"data":{"client":{"service":"59962","host":"127.0.0.1","family":"ipv4","websocket":false},"server":{"family":"ipv4","websocket":false,"host":"0.0.0.0","service":"5991","auth":"none"}},"event":"VNC_INITIALIZED","timestamp":{"microseconds":269673,"seconds":1497288789}}
17:55:14.7007 23796 EVENT {"event":"VNC_DISCONNECTED","data":{"client":{"host":"127.0.0.1","service":"59962","family":"ipv4","websocket":false},"server":{"host":"0.0.0.0","service":"5991","auth":"none","family":"ipv4","websocket":false}},"timestamp":{"seconds":1497289075,"microseconds":403984}}
17:55:14.7008 23796 EVENT {"timestamp":{"seconds":1497289075,"microseconds":404367},"event":"VNC_CONNECTED","data":{"server":{"family":"ipv4","websocket":false,"host":"0.0.0.0","service":"5991","auth":"none"},"client":{"service":"60192","host":"127.0.0.1","family":"ipv4","websocket":false}}}
17:55:14.7010 23796 EVENT {"data":{"server":{"host":"0.0.0.0","service":"5991","auth":"none","family":"ipv4","websocket":false},"client":{"host":"127.0.0.1","service":"60192","family":"ipv4","websocket":false}},"event":"VNC_INITIALIZED","timestamp":{"seconds":1497289075,"microseconds":406989}}
17:55:14.7011 23796 EVENT {"event":"RESET","timestamp":{"seconds":1497289201,"microseconds":579680}}
17:55:14.7012 23796 EVENT {"event":"VNC_DISCONNECTED","data":{"server":{"websocket":false,"family":"ipv4","auth":"none","service":"5991","host":"0.0.0.0"},"client":{"family":"ipv4","websocket":false,"host":"127.0.0.1","service":"60192"}},"timestamp":{"seconds":1497289250,"microseconds":388669}}
17:55:14.7013 23796 EVENT {"data":{"client":{"host":"127.0.0.1","service":"60384","websocket":false,"family":"ipv4"},"server":{"websocket":false,"family":"ipv4","auth":"none","host":"0.0.0.0","service":"5991"}},"event":"VNC_CONNECTED","timestamp":{"microseconds":389219,"seconds":1497289250}}
17:55:14.7014 23796 EVENT {"timestamp":{"microseconds":391948,"seconds":1497289250},"event":"VNC_INITIALIZED","data":{"server":{"auth":"none","host":"0.0.0.0","service":"5991","websocket":false,"family":"ipv4"},"client":{"family":"ipv4","websocket":false,"service":"60384","host":"127.0.0.1"}}}
17:55:14.7016 23796 EVENT {"timestamp":{"seconds":1497289255,"microseconds":465158},"event":"STOP"}
17:55:14.7017 23796 EVENT {"event":"RESUME","timestamp":{"microseconds":820956,"seconds":1497289258}}
17:55:14.7018 23796 EVENT {"event":"STOP","timestamp":{"seconds":1497289489,"microseconds":812375}}
17:55:14.7020 23796 EVENT {"timestamp":{"microseconds":676941,"seconds":1497289492},"event":"RESUME"}
17:55:14.7021 23796 EVENT {"event":"STOP","timestamp":{"seconds":1497289691,"microseconds":958030}}
17:55:14.7022 23796 EVENT {"timestamp":{"seconds":1497289695,"microseconds":19180},"event":"RESUME"}
17:55:14.7023 23796 EVENT {"event":"STOP","timestamp":{"microseconds":43114,"seconds":1497289937}}
17:55:14.7024 23796 EVENT {"timestamp":{"microseconds":167888,"seconds":1497289940},"event":"RESUME"}
17:55:14.7025 23796 EVENT {"timestamp":{"microseconds":532513,"seconds":1497289952},"event":"STOP"}
17:55:14.7026 23796 EVENT {"event":"RESUME","timestamp":{"seconds":1497289956,"microseconds":9466}}
17:55:14.7027 23796 EVENT {"event":"STOP","timestamp":{"microseconds":855517,"seconds":1497290112}}
17:55:14.7028 23796 EVENT {"event":"RESUME","timestamp":{"microseconds":698171,"seconds":1497290114}}
17:55:14.7035 23796 EVENT {"timestamp":{"seconds":1497290114,"microseconds":703403},"event":"STOP"}
17:55:14.7045 23796 EVENT {"timestamp":{"seconds":1497290114,"microseconds":704374},"event":"RESUME"}

Actions #7

Updated by riafarov over 6 years ago

  • Status changed from Feedback to New
  • Assignee deleted (riafarov)

Not really possible to investigate without extensive logging on workers. Nothing in logs which could explain system unresponsiveness. Unassigning.

Actions #8

Updated by okurz over 6 years ago

  • Target version set to Milestone 10
Actions #9

Updated by okurz over 6 years ago

  • Has duplicate action #18010: test fails in yast2_lan as first module and then all after that after what looks like giant slowdown on worker added
Actions #10

Updated by okurz over 6 years ago

Initial report was in #18010 but I closed that ticket in favor of this with more comments.

Observation

openQA test in scenario sle-12-SP3-Server-DVD-x86_64-iscsi_ibft@64bit fails in
yast2_lan
and then all modules after that one. commands don't finish, can't reach openqaworker host for uploading, etc.

nothing obvious in autoinst-log.txt, just correctly typed and then nothing is happening when the password prompt should show up

journal from corresponding worker for that job

Mar 24 15:17:45 openqaworker3 worker[24677]: [INFO] 21948: WORKING 835537
Mar 24 15:17:47 openqaworker3 qemu-system-x86_64[21974]: looking for plugins in '/usr/lib64/sasl2', failed to op
Mar 24 17:17:45 openqaworker3 worker[24677]: [WARN] max job time exceeded, aborting 00835537-sle-12-SP3-Server-D
Mar 24 17:17:45 openqaworker3 worker[24677]: killed 21948
Mar 24 17:17:50 openqaworker3 worker[24677]: [WARN] job 835537 spent more time than MAX_JOB_TIME
Mar 24 17:17:50 openqaworker3 worker[24677]: can't open /var/lib/openqa/pool/14/testresults/result-yast2_i.json:
Mar 24 17:17:51 openqaworker3 worker[24677]: [INFO] cleaning up 00835537-sle-12-SP3-Server-DVD-x86_64-Build0295-

then finally the job is incompleted because of 2h-timeout

Reproducible

Fails since (at least) Build 0295 (current job)

Expected result

Last good: 0294 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #11

Updated by okurz over 6 years ago

  • Target version changed from Milestone 10 to Milestone 11

discussed in M10 review meeting, we should take a look again after SLE15 settled a bit now

Actions #12

Updated by okurz over 6 years ago

https://openqa.suse.de/tests/1187031# looks pretty much like the same, don't you think?

Actions #13

Updated by okurz over 6 years ago

  • Due date set to 2017-11-08
Actions #14

Updated by sebchlad over 6 years ago

  • Due date deleted (2017-11-08)
Actions #15

Updated by okurz over 6 years ago

  • Target version changed from Milestone 11 to Milestone 13

Not that reproducible for SLE15 currently because SLE15 did not "settle enough". I think we should take a look again later.

Actions #16

Updated by riafarov about 6 years ago

  • Target version changed from Milestone 13 to Milestone 14

reviewers haven't seen such failure recently, resolve in next milestone if no update

Actions #17

Updated by riafarov about 6 years ago

  • Status changed from New to Resolved

As per discusion with Santi:
iirc at that point the vm tried to restore lastgood in curl_https since it failed in yast2_lan
and there was a bug in aarch64 where restoring from a snapshot was breaking usb

So resolve, reopen if reproduced.

Actions

Also available in: Atom PDF