action #46658

openqaworker4 caching fails

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:ResolvedStart date:25/01/2019
Priority:UrgentDue date:
Assignee:okurz% Done:

0%

Category:-
Target version:-
Duration:

Description

[25/01/2019 08:27:45] <_Marcus_> coolo: oepnsuse openqa is ... broken ;)
[25/01/2019 08:28:40] <coolo> looks like it
[25/01/2019 08:29:15] <_Marcus_> the logfiles dont tell me anything, but it seems the workers dont launch at all
[25/01/2019 08:32:28] <coolo> _Marcus_: it seems that just openqaworker4 is broken. but it managed to 'test' all your jobs very quickly and made them all incomplete :)
[25/01/2019 08:38:50] <coolo> _Marcus_: I removed all caches and restarted the services
[25/01/2019 08:38:58] <coolo> if https://openqa.opensuse.org/tests/837660#live survices you can restart the rest
[25/01/2019 08:49:28] <okurz> coolo: https://openqa.opensuse.org/tests/837660/file/autoinst-log.txt is incomplete, caching synced lots of stuff, e.g. tests but then fails with "[2019-01-25T08:47:00.0296 CET] [info] result: setup failure: Failed to rsync tests: exit 0"
[25/01/2019 08:51:06] <coolo> okurz: yes. not my monkey

Related issues

Copied to openQA Project - action #46664: useless debug output of "GLOB(…)" from openQA worker Resolved 25/01/2019

History

#1 Updated by okurz about 1 year ago

  • Priority changed from Immediate to Urgent

I rebooted the machine to no avail. I disabled all worker instances for now with systemctl disable --now openqa-worker.target and retriggered all incomplete jobs from today on the openqaworker4 instances looked up from https://openqa.opensuse.org/admin/workers

#2 Updated by okurz about 1 year ago

  • Copied to action #46664: useless debug output of "GLOB(…)" from openQA worker added

#3 Updated by okurz about 1 year ago

  • Status changed from In Progress to Feedback

This triggered tests on other machines but from looking at https://openqa.opensuse.org/tests/ it seems all x86_64 workers are bad. So maybe something in the last update from tonight, however aarch64 seems to be just fine

Trying to revert to an earlier snapshot on w1 for debugging

openqaworker1:~ # snapper ls
Type   | #   | Pre # | Date                    | User | Cleanup | Description             | Userdata     
-------+-----+-------+-------------------------+------+---------+-------------------------+--------------
single | 0   |       |                         | root |         | current                 |              
single | 153 |       | 2019-01-14T00:35:58 CET | root | number  | Snapshot Update of #152 | important=yes
single | 154 |       | 2019-01-15T01:11:08 CET | root | number  | Snapshot Update of #153 | important=yes
single | 155 |       | 2019-01-16T00:59:46 CET | root | number  | Snapshot Update of #154 | important=yes
single | 156 |       | 2019-01-17T01:28:52 CET | root | number  | Snapshot Update of #155 | important=yes
single | 157 |       | 2019-01-18T01:50:56 CET | root | number  | Snapshot Update of #156 | important=yes
single | 158 |       | 2019-01-19T01:54:50 CET | root | number  | Snapshot Update of #157 | important=yes
single | 159 |       | 2019-01-20T01:39:26 CET | root | number  | Snapshot Update of #158 | important=yes
single | 160 |       | 2019-01-21T00:30:04 CET | root | number  | Snapshot Update of #159 | important=yes
single | 161 |       | 2019-01-22T01:45:42 CET | root | number  | Snapshot Update of #160 | important=yes
single | 162 |       | 2019-01-23T01:50:49 CET | root | number  | Snapshot Update of #161 | important=yes
single | 163 |       | 2019-01-24T00:15:09 CET | root |         | Snapshot Update of #162 |              
single | 164 |       | 2019-01-25T00:54:24 CET | root |         | Snapshot Update of #163 |              
openqaworker1:~ # transactional-update rollback 164
transactional-update 2.11 started
Options: rollback 164
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 164...
Please reboot to finish rollback!
openqaworker1:~ # reboot
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.

Ok, did not help. I think this was the wrong snapshot.

single | 163 |       | 2019-01-24T00:15:09 CET | root |         | Snapshot Update of #162 |              
single | 164 |       | 2019-01-25T00:54:24 CET | root |         | Snapshot Update of #163 |              
openqaworker1:~ # systemctl disable --now openqa-worker.target && transactional-update rollback 163 && reboot
transactional-update 2.11 started
Options: rollback 163
Separate /var detected.
Separate /etc detected.
Rollback to snapshot 163...
Please reboot to finish rollback!
Connection to openqaworker1 closed by remote host.
Connection to openqaworker1 closed.

The latest changes of openQA-worker:

# rpm -q --changelog openQA-worker | less
* Thu Jan 24 2019 coolo@suse.com
- Update to version 4.6.1548347463.2f69b0bf:
  * worker: Fix two perl warnings
  * Use nicer syntax for default vars
  * Add (carried over) bugref to job.done AMQP event

I suspect worker: Fix two perl warnings

I disabled the workers on all w1, w4, imagetester with systemctl disable --now openqa-worker.target (ok, on imagetester I manually called systemctl stop openqa-worker@{1..5} because the target does not seem to be there).

Then I retriggered all incomplete jobs on https://openqa.opensuse.org/tests?&resultfilter=Incomplete , this time manually but I think I already applied a script once. And now openqaworker1 picked them up and it seems to work fine, e.g. https://openqa.opensuse.org/tests/837782 .
In the service journal I can find again the warnings that the above commit tried to fix:

Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 245.
Jan 25 09:36:48 openqaworker1 worker[3143]: [info] rsync:
Jan 25 09:36:48 openqaworker1 worker[3143]: Use of uninitialized value $exit in numeric eq (==) at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.

whereas on w4 that is still affected the journal says

Jan 25 08:59:57 openqaworker4 worker[2427]: Use of uninitialized value $lines[0] in join or string at /usr/share/openqa/script/../lib/OpenQA/Utils.pm line 387.
Jan 25 09:00:02 openqaworker4 worker[2427]: Use of uninitialized value $exit in concatenation (.) or string at /usr/share/openqa/script/../lib/OpenQA/Worker/Engines/isotovideo.pm line 246.

I rolled back w4 and imagester as well. They are happily executing test jobs properly now.

Regarding "How did we come to this?"

https://github.com/os-autoinst/openQA/pull/1965 was merged despite tests failing. One reason is certainly that many tests failed because of a new, unspecified dependency "lsof" for a depending package, e.g. see https://travis-ci.org/os-autoinst/openQA/jobs/482477148 . This probably lead mkittler to the conclusion that tests only failed because of that. However https://travis-ci.org/os-autoinst/openQA/jobs/482477148#L1136 was terminated because of a "stalled build" so could point to some other problems here.

Fix

https://github.com/os-autoinst/openQA/pull/1972

It's merged, the packages were built for openSUSE Leap 42.3 already and I deployed them on power8. Jobs started fine so I guess we are settled for tomorrow's automatic update.

#4 Updated by okurz about 1 year ago

hm, many jobs were fine, like https://openqa.opensuse.org/tests/839085 but others still incomplete like https://openqa.opensuse.org/tests/839178/file/autoinst-log.txt with

[2019-01-26T06:25:31.0098 UTC] [info] result: setup failure: Failed to rsync tests: exit

which is even on the non-automatically updated power8. Restarted workers on power8 and at least one job https://openqa.opensuse.org/tests/839191 runs fine. Have retriggered all incompletes on https://openqa.opensuse.org/tests?&resultfilter=Incomplete and so far all are running without any changes on the workers.

Seems like some still have a problem, e.g. https://openqa.opensuse.org/tests/839420/file/autoinst-log.txt showing "[2019-01-26T08:45:23.0632 CET] [info] result: setup failure: Failed to rsync tests: exit 24". What is that supposed to mean?

Some more fails even though many jobs are fine:

[2019-01-26T09:58:49.0571 CET] [debug] Downloading openSUSE-Leap-15.0-DVD-x86_64.iso - request sent to Cache Service.
[2019-01-26T09:58:54.0644 CET] [debug] Download of openSUSE-Leap-15.0-DVD-x86_64.iso processed
[2019-01-26T09:58:54.0647 CET] [debug] 
[2019-01-26T09:58:59.0719 CET] [warn] job is missing files, releasing job
[2019-01-26T09:58:59.0750 CET] [info] +++ worker notes +++
[2019-01-26T09:58:59.0750 CET] [info] end time: 2019-01-26 08:58:59
[2019-01-26T09:58:59.0751 CET] [info] result: setup failure: Failed to rsync tests: exit
[2019-01-26T10:06:10.0093 CET] [debug] Downloading opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 - request sent to Cache Service.
[2019-01-26T10:06:15.0148 CET] [debug] Download of opensuse-42.3-x86_64-GM-gnome@64bit.qcow2 processed
[2019-01-26T10:06:15.0150 CET] [debug] 
[2019-01-26T10:06:15.0162 CET] [debug] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso - request sent to Cache Service.
[2019-01-26T10:06:20.0213 CET] [debug] Download of openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso processed
[2019-01-26T10:06:20.0218 CET] [debug] [INFO] OpenQA::Worker::Cache: loading database from /var/lib/openqa/cache/cache.sqlite
[DEBUG] CACHE: Health: Real size: 52674571776, Configured limit: 53687091200
[INFO] OpenQA::Worker::Cache: Initialized with localhost at /var/lib/openqa/cache, current size is 52674571776
[INFO] Downloading openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso from http://openqa1-opensuse/tests/839451/asset/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso
[INFO] CACHE: updating the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso last usage
[DEBUG] CACHE: Content has not changed, not downloading the /var/lib/openqa/cache/openqa1-opensuse/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20190124-Media.iso but updating last use

[2019-01-26T10:06:25.0266 CET] [warn] job is missing files, releasing job
[2019-01-26T10:06:25.0292 CET] [info] +++ worker notes +++
[2019-01-26T10:06:25.0292 CET] [info] end time: 2019-01-26 09:06:25
[2019-01-26T10:06:25.0292 CET] [info] result: setup failure: Failed to rsync tests: exit 

I do not yet see the pattern :(

EDIT: 6 more jobs failed with result: setup failure: Failed to rsync tests: exit

But I think I got it now. The above message means that the exit is undef which was previously implicitly accepted by $exit == 0. This happens when no test syncing is actually necessary. -> https://github.com/os-autoinst/openQA/pull/1976

Also hotpatched openqaworker1 (transactional-update shell, edit, reboot).

merged PR and deployed on all workers (except openqaworker1 which has the hot fix already). Awaiting auto-deployment of tomorrow.

#5 Updated by binary_sequence about 1 year ago

@okurz, have you tried to execute manually

_openqa-worker@openqaworker4:/var/lib/openqa/chache/openqa.opensuse.org/tests> rsync -va rsync://openqa.opensuse.org/openqa-tests .

to see what happens?

Have you check that _openqa-worker has write permission into /var/lib/openqa/chache/openqa.opensuse.org/tests

#6 Updated by okurz about 1 year ago

thanks for your hints but it looks like all syncing itself is fine, only I misinterpreted "undef" as a failure but it's "success", fixed with https://github.com/os-autoinst/openQA/pull/1976

All looks good now on the workers for now. There was one incomplete with "exit 24" which I retriggered but I do not think it's related to my changes because proper numerical results should be handled the same way as in before.

#7 Updated by okurz about 1 year ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF