Project

General

Profile

action #46658

openqaworker4 caching fails

Added by okurz about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
Start date:
2019-01-25
Due date:
% Done:

0%

Estimated time:

Description

[25/01/2019 08:27:45] <_Marcus_> coolo: oepnsuse openqa is ... broken ;)
[25/01/2019 08:28:40] <coolo> looks like it
[25/01/2019 08:29:15] <_Marcus_> the logfiles dont tell me anything, but it seems the workers dont launch at all
[25/01/2019 08:32:28] <coolo> _Marcus_: it seems that just openqaworker4 is broken. but it managed to 'test' all your jobs very quickly and made them all incomplete :)
[25/01/2019 08:38:50] <coolo> _Marcus_: I removed all caches and restarted the services
[25/01/2019 08:38:58] <coolo> if https://openqa.opensuse.org/tests/837660#live survices you can restart the rest
[25/01/2019 08:49:28] <okurz> coolo: https://openqa.opensuse.org/tests/837660/file/autoinst-log.txt is incomplete, caching synced lots of stuff, e.g. tests but then fails with "[2019-01-25T08:47:00.0296 CET] [info] result: setup failure: Failed to rsync tests: exit 0"
[25/01/2019 08:51:06] <coolo> okurz: yes. not my monkey

Related issues

Related to openQA Project - action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedbackResolved2020-08-04

Copied to openQA Project - action #46664: useless debug output of "GLOB(…)" from openQA workerResolved2019-01-25

History

#1 Updated by okurz about 4 years ago

  • Priority changed from Immediate to Urgent

I rebooted the machine to no avail. I disabled all worker instances for now with systemctl disable --now openqa-worker.target and retriggered all incomplete jobs from today on the openqaworker4 instances looked up from https://openqa.opensuse.org/admin/workers

#2 Updated by okurz about 4 years ago

  • Copied to action #46664: useless debug output of "GLOB(…)" from openQA worker added

#3 Updated by okurz about 4 years ago

  • Status changed from In Progress to Feedback

This triggered tests on other machines but from looking at https://openqa.opensuse.org/tests/ it seems all x86_64 workers are bad. So maybe something in the last update from tonight, however aarch64 seems to be just fine

Trying to revert to an earlier snapshot on w1 for debugging

openqaworker1:~ # snapper ls
Type   | #   | Pre # | Date                    | User | Cleanup | Description             | Userdata     
-------+-----+-------+-------------------------+------+---------+-------------------------+--------------
single | 0   |       |                         | root |         | current                 |              
single | 153 |       | 2019-01-14T00:35:58 CET | root | number  | Snapshot Update of #152 | important=yes
single | 154 |       | 2019-01-15T01:11:08 CET | root | number  | Snapshot Update of #153 | important=yes
single | 155 |       | 2019-01-16T00:59:46 CET | root | number  | Snapshot Update of #154 | important=yes
single | 156 |       | 2019-01-17T01:28:52 CET | root | number  | Snapshot Update of #155 | important=yes
single | 157 |       | 2019-01-18T01:50:56 CET | root | number  | Snapshot Update of #156 | important=yes
single | 158 |       | 2019-01-19T01:54:50 CET | root | number  | Snapshot Update of #157 | important=yes
single | 159 |       | 2019-01-20T01:39:26 CET | root | number  | Snapshot Update of #158 | important=yes
single | 160 |       | 2019-01-21T00:30:04 CET | root | number  | Snapshot Update of #159 | important=yes
single | 161 |       | 2019-01-22T01:45:42 CET | root | number  | Snapshot Update of #160 | important=yes
single | 162 |       | 2019-01-23T01:50:49 CET | root | number  | Snapshot Update of #161 | important=yes
single | 163 |       | 2019-01-24T00:15:09 CET | root |         | Snapshot Update of #162 |              
single | 164 |       | 2019-01-25T00:54:24 CET | root |         | Snapshot Update of #163 |              
openqaworker1:~ # transactional-update rollback 164
transactional-update 2.11 started
Options: rollback 164
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 164...
Please reboot to finish rollback!
openqaworker1:~ # reboot
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.

Ok, did not help. I think this was the wrong snapshot.

single | 163 |       | 2019-01-24T00:15:09 CET | root |         | Snapshot Update of #162 |              
single | 164 |       | 2019-01-25T00:54:24 CET | root |         | Snapshot Update of #163 |              
openqaworker1:~ # systemctl disable --now openqa-worker.target && transactional-update rollback 163 && reboot
transactional-update 2.11 started
Options: rollback 163
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 163...
Please reboot to finish rollback!
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.

The latest changes of openQA-worker:

# rpm -q --changelog openQA-worker | less
* Thu Jan 24 2019 coolo@suse.com
- Update to version 4.6.1548347463.2f69b0bf:
  * worker: Fix two perl warnings
  * Use nicer syntax for default vars
  * Add (carried over) bugref to job.done AMQP event

I suspect worker: Fix two perl warnings

I disabled the workers on all w1, w4, imagetester with systemctl disable --now openqa-worker.target (ok, on imagetester I manually called systemctl stop openqa-worker@{1..5} because the target does not seem to be there).

Then I retriggered all incomplete jobs on https://openqa.opensuse.org/tests?&resultfilter=Incomplete , this time manually but I think I already applied a script once. And now openqaworker1 picked them up and it seems to work fine, e.g. https://openqa.opensuse.org/tests/837782 .
In the service journal I can find again the warnings that the above commit tried to fix:

Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 245.
Jan 25 09:36:48 openqaworker1 worker[3143]: [info] rsync:
Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value $exit in numeric eq (==) at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.

whereas on w4 that is still affected the journal says

Jan 25 08:59:57 openqaworker4 worker[2427]: Use of uninitialized value $lines[0] in join or string at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 387.
Jan 25 09:00:02 openqaworker4 worker[2427]: Use of uninitialized value $exit in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.

I rolled back w4 and imagester as well. They are happily executing test jobs properly now.

Regarding "How did we come to this?"

https://github.com/os-autoinst/openQA/pull/1965 was merged despite tests failing. One reason is certainly that many tests failed because of a new, unspecified dependency "lsof" for a depending package, e.g. see https://travis-ci.org/os-autoinst/openQA/jobs/482477148 . This probably lead mkittler to the conclusion that tests only failed because of that. However https://travis-ci.org/os-autoinst/openQA/jobs/482477148#L1136 was terminated because of a "stalled build" so could point to some other problems here.

Fix

https://github.com/os-autoinst/openQA/pull/1972

It's merged, the packages were built for openSUSE Leap 42.3 already and I deployed them on power8. Jobs started fine so I guess we are settled for tomorrow's automatic update.

#4 Updated by okurz about 4 years ago

hm, many jobs were fine, like https://openqa.opensuse.org/tests/839085 but others still incomplete like https://openqa.opensuse.org/tests/839178/file/autoinst-log.txt with

[2019-01-26T06:25:31.0098 UTC] [info] result: setup failure: Failed to rsync tests: exit

which is even on the non-automatically updated power8. Restarted workers on power8 and at least one job https://openqa.opensuse.org/tests/839191 runs fine. Have retriggered all incompletes on https://openqa.opensuse.org/tests?&resultfilter=Incomplete and so far all are running without any changes on the workers.

Seems like some still have a problem, e.g. https://openqa.opensuse.org/tests/839420/file/autoinst-log.txt showing "[2019-01-26T08:45:23.0632 CET] [info] result: setup failure: Failed to rsync tests: exit 24". What is that supposed to mean?

Some more fails even though many jobs are fine:

[2019-01-26T09:58:49.0571 CET] [debug] Downloading openSUSE-Leap-15.0-DVD-x86_64.iso - request sent to Cache Service.
[2019-01-26T09:58:54.0644 CET] [debug] Download of openSUSE-Leap-15.0-DVD-x86_64.iso processed
[2019-01-26T09:58:54.0647 CET] [debug] 
[2019-01-26T09:58:59.0719 CET] [warn] job is missing files, releasing job
[2019-01-26T09:58:59.0750 CET] [info] +++ worker notes +++
[2019-01-26T09:58:59.0750 CET] [info] end time: 2019-01-26 08:58:59
[2019-01-26T09:58:59.0751 CET] [info] result: setup failure: Failed to rsync tests: exit
[2019-01-26T10:06:10.0093 CET] [debug] Downloading opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 - request sent to Cache Service.
[2019-01-26T10:06:15.0148 CET] [debug] Download of opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 processed
[2019-01-26T10:06:15.0150 CET] [debug] 
[2019-01-26T10:06:15.0162 CET] [debug] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso - request sent to Cache Service.
[2019-01-26T10:06:20.0213 CET] [debug] Download of openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso processed
[2019-01-26T10:06:20.0218 CET] [debug] [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 52674571776, Configured limit: 53687091200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52674571776
[INFO] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso from http://openqa1-opensuse/tests/839451/asset/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso
[INFO] CACHE: updating the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso last usage
[DEBUG] CACHE: Content has not changed, not downloading the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso but updating last use

[2019-01-26T10:06:25.0266 CET] [warn] job is missing files, releasing job
[2019-01-26T10:06:25.0292 CET] [info] +++ worker notes +++
[2019-01-26T10:06:25.0292 CET] [info] end time: 2019-01-26 09:06:25
[2019-01-26T10:06:25.0292 CET] [info] result: setup failure: Failed to rsync tests: exit 

I do not yet see the pattern :(

EDIT: 6 more jobs failed with result: setup failure: Failed to rsync tests: exit

But I think I got it now. The above message means that the exit is undef which was previously implicitly accepted by $exit == 0. This happens when no test syncing is actually necessary. -> https://github.com/os-autoinst/openQA/pull/1976

Also hotpatched openqaworker1 (transactional-update shell, edit, reboot).

merged PR and deployed on all workers (except openqaworker1 which has the hot fix already). Awaiting auto-deployment of tomorrow.

#5 Updated by binary_sequence about 4 years ago

okurz, have you tried to execute manually

_openqa-worker@openqaworker4:/var/lib/openqa/chache/openqa.opensuse.org/tests> rsync -va rsync://openqa.opensuse.org/openqa-tests .

to see what happens?

Have you check that _openqa-worker has write permission into /var/lib/openqa/chache/openqa.opensuse.org/tests

#6 Updated by okurz about 4 years ago

thanks for your hints but it looks like all syncing itself is fine, only I misinterpreted "undef" as a failure but it's "success", fixed with https://github.com/os-autoinst/openQA/pull/1976

All looks good now on the workers for now. There was one incomplete with "exit 24" which I retriggered but I do not think it's related to my changes because proper numerical results should be handled the same way as in before.

#7 Updated by okurz about 4 years ago

  • Status changed from Feedback to Resolved

#8 Updated by okurz over 2 years ago

  • Related to action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback added

Also available in: Atom PDF