action #46658
closedopenqaworker4 caching fails
Added by okurz almost 6 years ago. Updated almost 6 years ago.
0%
Description
[25/01/2019 08:27:45] <_Marcus_> coolo: oepnsuse openqa is ... broken ;)
[25/01/2019 08:28:40] <coolo> looks like it
[25/01/2019 08:29:15] <_Marcus_> the logfiles dont tell me anything, but it seems the workers dont launch at all
[25/01/2019 08:32:28] <coolo> _Marcus_: it seems that just openqaworker4 is broken. but it managed to 'test' all your jobs very quickly and made them all incomplete :)
[25/01/2019 08:38:50] <coolo> _Marcus_: I removed all caches and restarted the services
[25/01/2019 08:38:58] <coolo> if https://openqa.opensuse.org/tests/837660#live survices you can restart the rest
[25/01/2019 08:49:28] <okurz> coolo: https://openqa.opensuse.org/tests/837660/file/autoinst-log.txt is incomplete, caching synced lots of stuff, e.g. tests but then fails with "[2019-01-25T08:47:00.0296 CET] [info] result: setup failure: Failed to rsync tests: exit 0"
[25/01/2019 08:51:06] <coolo> okurz: yes. not my monkey
Updated by okurz almost 6 years ago
- Priority changed from Immediate to Urgent
I rebooted the machine to no avail. I disabled all worker instances for now with systemctl disable --now openqa-worker.target
and retriggered all incomplete jobs from today on the openqaworker4 instances looked up from https://openqa.opensuse.org/admin/workers
Updated by okurz almost 6 years ago
- Copied to action #46664: useless debug output of "GLOB(…)" from openQA worker added
Updated by okurz almost 6 years ago
- Status changed from In Progress to Feedback
This triggered tests on other machines but from looking at https://openqa.opensuse.org/tests/ it seems all x86_64 workers are bad. So maybe something in the last update from tonight, however aarch64 seems to be just fine
Trying to revert to an earlier snapshot on w1 for debugging
openqaworker1:~ # snapper ls
Type | # | Pre # | Date | User | Cleanup | Description | Userdata
-------+-----+-------+-------------------------+------+---------+-------------------------+--------------
single | 0 | | | root | | current |
single | 153 | | 2019-01-14T00:35:58 CET | root | number | Snapshot Update of #152 | important=yes
single | 154 | | 2019-01-15T01:11:08 CET | root | number | Snapshot Update of #153 | important=yes
single | 155 | | 2019-01-16T00:59:46 CET | root | number | Snapshot Update of #154 | important=yes
single | 156 | | 2019-01-17T01:28:52 CET | root | number | Snapshot Update of #155 | important=yes
single | 157 | | 2019-01-18T01:50:56 CET | root | number | Snapshot Update of #156 | important=yes
single | 158 | | 2019-01-19T01:54:50 CET | root | number | Snapshot Update of #157 | important=yes
single | 159 | | 2019-01-20T01:39:26 CET | root | number | Snapshot Update of #158 | important=yes
single | 160 | | 2019-01-21T00:30:04 CET | root | number | Snapshot Update of #159 | important=yes
single | 161 | | 2019-01-22T01:45:42 CET | root | number | Snapshot Update of #160 | important=yes
single | 162 | | 2019-01-23T01:50:49 CET | root | number | Snapshot Update of #161 | important=yes
single | 163 | | 2019-01-24T00:15:09 CET | root | | Snapshot Update of #162 |
single | 164 | | 2019-01-25T00:54:24 CET | root | | Snapshot Update of #163 |
openqaworker1:~ # transactional-update rollback 164
transactional-update 2.11 started
Options: rollback 164
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 164...
Please reboot to finish rollback!
openqaworker1:~ # reboot
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.
Ok, did not help. I think this was the wrong snapshot.
single | 163 | | 2019-01-24T00:15:09 CET | root | | Snapshot Update of #162 |
single | 164 | | 2019-01-25T00:54:24 CET | root | | Snapshot Update of #163 |
openqaworker1:~ # systemctl disable --now openqa-worker.target && transactional-update rollback 163 && reboot
transactional-update 2.11 started
Options: rollback 163
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 163...
Please reboot to finish rollback!
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.
The latest changes of openQA-worker
:
# rpm -q --changelog openQA-worker | less
* Thu Jan 24 2019 coolo@suse.com
- Update to version 4.6.1548347463.2f69b0bf:
* worker: Fix two perl warnings
* Use nicer syntax for default vars
* Add (carried over) bugref to job.done AMQP event
I suspect worker: Fix two perl warnings
I disabled the workers on all w1, w4, imagetester with systemctl disable --now openqa-worker.target
(ok, on imagetester I manually called systemctl stop openqa-worker@{1..5}
because the target does not seem to be there).
Then I retriggered all incomplete jobs on https://openqa.opensuse.org/tests?&resultfilter=Incomplete , this time manually but I think I already applied a script once. And now openqaworker1 picked them up and it seems to work fine, e.g. https://openqa.opensuse.org/tests/837782 .
In the service journal I can find again the warnings that the above commit tried to fix:
Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 245.
Jan 25 09:36:48 openqaworker1 worker[3143]: [info] rsync:
Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value $exit in numeric eq (==) at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.
whereas on w4 that is still affected the journal says
Jan 25 08:59:57 openqaworker4 worker[2427]: Use of uninitialized value $lines[0] in join or string at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 387.
Jan 25 09:00:02 openqaworker4 worker[2427]: Use of uninitialized value $exit in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.
I rolled back w4 and imagester as well. They are happily executing test jobs properly now.
Regarding "How did we come to this?"¶
https://github.com/os-autoinst/openQA/pull/1965 was merged despite tests failing. One reason is certainly that many tests failed because of a new, unspecified dependency "lsof" for a depending package, e.g. see https://travis-ci.org/os-autoinst/openQA/jobs/482477148 . This probably lead mkittler to the conclusion that tests only failed because of that. However https://travis-ci.org/os-autoinst/openQA/jobs/482477148#L1136 was terminated because of a "stalled build" so could point to some other problems here.
Fix¶
https://github.com/os-autoinst/openQA/pull/1972
It's merged, the packages were built for openSUSE Leap 42.3 already and I deployed them on power8. Jobs started fine so I guess we are settled for tomorrow's automatic update.
Updated by okurz almost 6 years ago
hm, many jobs were fine, like https://openqa.opensuse.org/tests/839085 but others still incomplete like https://openqa.opensuse.org/tests/839178/file/autoinst-log.txt with
[2019-01-26T06:25:31.0098 UTC] [info] result: setup failure: Failed to rsync tests: exit
which is even on the non-automatically updated power8. Restarted workers on power8 and at least one job https://openqa.opensuse.org/tests/839191 runs fine. Have retriggered all incompletes on https://openqa.opensuse.org/tests?&resultfilter=Incomplete and so far all are running without any changes on the workers.
Seems like some still have a problem, e.g. https://openqa.opensuse.org/tests/839420/file/autoinst-log.txt showing "[2019-01-26T08:45:23.0632 CET] [info] result: setup failure: Failed to rsync tests: exit 24". What is that supposed to mean?
Some more fails even though many jobs are fine:
- https://openqa.opensuse.org/tests/839527/file/autoinst-log.txt , https://openqa.opensuse.org/tests/839528/file/autoinst-log.txt , https://openqa.opensuse.org/tests/839530/file/autoinst-log.txt with
[2019-01-26T09:58:49.0571 CET] [debug] Downloading openSUSE-Leap-15.0-DVD-x86_64.iso - request sent to Cache Service.
[2019-01-26T09:58:54.0644 CET] [debug] Download of openSUSE-Leap-15.0-DVD-x86_64.iso processed
[2019-01-26T09:58:54.0647 CET] [debug]
[2019-01-26T09:58:59.0719 CET] [warn] job is missing files, releasing job
[2019-01-26T09:58:59.0750 CET] [info] +++ worker notes +++
[2019-01-26T09:58:59.0750 CET] [info] end time: 2019-01-26 08:58:59
[2019-01-26T09:58:59.0751 CET] [info] result: setup failure: Failed to rsync tests: exit
[2019-01-26T10:06:10.0093 CET] [debug] Downloading opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 - request sent to Cache Service.
[2019-01-26T10:06:15.0148 CET] [debug] Download of opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 processed
[2019-01-26T10:06:15.0150 CET] [debug]
[2019-01-26T10:06:15.0162 CET] [debug] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso - request sent to Cache Service.
[2019-01-26T10:06:20.0213 CET] [debug] Download of openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso processed
[2019-01-26T10:06:20.0218 CET] [debug] [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 52674571776, Configured limit: 53687091200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52674571776
[INFO] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso from http://openqa1-opensuse/tests/839451/asset/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso
[INFO] CACHE: updating the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso last usage
[DEBUG] CACHE: Content has not changed, not downloading the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso but updating last use
[2019-01-26T10:06:25.0266 CET] [warn] job is missing files, releasing job
[2019-01-26T10:06:25.0292 CET] [info] +++ worker notes +++
[2019-01-26T10:06:25.0292 CET] [info] end time: 2019-01-26 09:06:25
[2019-01-26T10:06:25.0292 CET] [info] result: setup failure: Failed to rsync tests: exit
I do not yet see the pattern :(
EDIT: 6 more jobs failed with result: setup failure: Failed to rsync tests: exit
But I think I got it now. The above message means that the exit is undef
which was previously implicitly accepted by $exit == 0
. This happens when no test syncing is actually necessary. -> https://github.com/os-autoinst/openQA/pull/1976
Also hotpatched openqaworker1 (transactional-update shell
, edit, reboot
).
merged PR and deployed on all workers (except openqaworker1 which has the hot fix already). Awaiting auto-deployment of tomorrow.
Updated by binary_sequence almost 6 years ago
@okurz, have you tried to execute manually
_openqa-worker@openqaworker4:/var/lib/openqa/chache/openqa.opensuse.org/tests> rsync -va rsync://openqa.opensuse.org/openqa-tests .
to see what happens?
Have you check that _openqa-worker has write permission into /var/lib/openqa/chache/openqa.opensuse.org/tests
Updated by okurz almost 6 years ago
thanks for your hints but it looks like all syncing itself is fine, only I misinterpreted "undef" as a failure but it's "success", fixed with https://github.com/os-autoinst/openQA/pull/1976
All looks good now on the workers for now. There was one incomplete with "exit 24" which I retriggered but I do not think it's related to my changes because proper numerical results should be handled the same way as in before.
Updated by okurz over 4 years ago
- Related to action #69553: job incompletes with "Failed to rsync tests: exit code 10":retry, improve user feedback added