action #105855
closed
[easy][beginner] openqa-worker@.service should handle NTP unavailability gracefully
Added by dheidler almost 3 years ago.
Updated over 2 years ago.
Category:
Feature requests
Description
Observation¶
-- Logs begin at Thu 2022-01-06 12:00:01 UTC, end at Wed 2022-02-02 16:18:15 UTC. --
Jan 06 12:00:37 localhost systemd[1]: Starting openQA Worker #4...
Jan 06 12:00:37 localhost systemd[1]: Started openQA Worker #4.
Jan 06 12:00:40 localhost worker[1306]: [info] worker 4:
Jan 06 12:00:40 localhost worker[1306]: - config file: /etc/openqa/workers.ini
Jan 06 12:00:40 localhost worker[1306]: - worker hostname: localhost
Jan 06 12:00:40 localhost worker[1306]: - isotovideo version: 24
Jan 06 12:00:40 localhost worker[1306]: - websocket API version: 1
Jan 06 12:00:40 localhost worker[1306]: - web UI hosts: http://10.131.0.32
Jan 06 12:00:40 localhost worker[1306]: - class: generalhw_RPi4
Jan 06 12:00:40 localhost worker[1306]: - no cleanup: no
Jan 06 12:00:40 localhost worker[1306]: - pool directory: /var/lib/openqa/pool/4
Jan 06 12:00:40 localhost worker[1306]: [info] Project dir for host http://10.131.0.32 is /var/lib/openqa/share
Jan 06 12:00:40 localhost worker[1306]: [info] Registering with openQA http://10.131.0.32
Jan 06 12:00:40 localhost worker[1306]: [error] Failed to register at http://10.131.0.32 - 403 response: timestamp mismatch - ignoring server
Jan 06 12:00:40 localhost worker[1306]: [error] Stopping because registration with all configured web UI hosts failed
Jan 06 12:00:40 localhost systemd[1]: openqa-worker@4.service: Succeeded.
This worker is running on a raspberry pi without a real time clock.
The output is from today (Feb 02) and chronyd is enabled on that machine.
Still the worker doesn't come up because the worker is started before NTP sync is done.
Acceptance criteria¶
- AC1: openQA worker does not fail with an error and does not report an error when a time source is not (yet) available
Suggestions¶
- Make the code itself resilient to time not being available yet, e.g. just wait until it becomes available
Further details¶
entrance level issue
Files
- Category set to Feature requests
- Target version set to future
I suggest to make the code itself resilient to time not being available yet, e.g. just wait until it becomes available
- Tags set to easy, beginner, entrance level
- Subject changed from openqa-worker@.service should depend on time-sync.target to [easy][beginner] openqa-worker@.service should handle NTP unavailability gracefully
- Description updated (diff)
Adding Wants=time-sync.target
to openqa-worker.target
should make this even easier. We don't need to write code to handle this as long as whatever service is being used includes time-set.target in Before= and Wants=. And if nothing implements it it'll be racy like it is now:
https://github.com/os-autoinst/openQA/pull/4692
- Related to action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in schedule added
- Due date set to 2022-06-30
- Status changed from New to Feedback
- Assignee set to livdywan
cdywan wrote:
Adding Wants=time-sync.target
to openqa-worker.target
should make this even easier. We don't need to write code to handle this as long as whatever service is being used includes time-set.target in Before= and Wants=. And if nothing implements it it'll be racy like it is now:
https://github.com/os-autoinst/openQA/pull/4692
I guess I should take the ticket and monitor this for a while since this fulfills the AC.
I did some tests using the new packages with the updated unit file, but it doesn't seem to do the trick:
openqa-piworker:~ # systemctl cat openqa-worker@3
# /usr/lib/systemd/system/openqa-worker-plain@.service
# unit description file for openQA worker processes
# start using e.g.
# systemctl start openqa-worker@1.service
# replace '1' with the instance number you want
[Unit]
Description=openQA Worker #%i
After=openqa-slirpvde.service network.target nss-lookup.target remote-fs.target time-sync.target
PartOf=openqa-worker.target
[Service]
Type=simple
PermissionsStartOnly=True
ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/%i
ExecStart=/usr/share/openqa/script/worker --instance %i
User=_openqa-worker
KillMode=mixed
Restart=on-failure
Delegate=yes
Slice=openqa-worker.slice
[Install]
WantedBy=multi-user.target
openqa-piworker:~ # systemctl status openqa-worker@3
● openqa-worker-plain@3.service - openQA Worker #3
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-plain@.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Tue 2022-04-19 12:00:41 UTC; 1 month 21 days ago
Process: 1705 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/3 (code=exited, status=0/SUCCESS)
Process: 1751 ExecStart=/usr/share/openqa/script/worker --instance 3 (code=exited, status=0/SUCCESS)
Main PID: 1751 (code=exited, status=0/SUCCESS)
Apr 19 12:00:41 openqa-piworker worker[1751]: - websocket API version: 1
Apr 19 12:00:41 openqa-piworker worker[1751]: - web UI hosts: openqa.suse.de
Apr 19 12:00:41 openqa-piworker worker[1751]: - class: generalhw_RPi3B
Apr 19 12:00:41 openqa-piworker worker[1751]: - no cleanup: no
Apr 19 12:00:41 openqa-piworker worker[1751]: - pool directory: /var/lib/openqa/pool/3
Apr 19 12:00:41 openqa-piworker worker[1751]: [info] Project dir for host openqa.suse.de is /var/lib/openqa/share
Apr 19 12:00:41 openqa-piworker worker[1751]: [info] Registering with openQA openqa.suse.de
Apr 19 12:00:41 openqa-piworker worker[1751]: [error] Failed to register at openqa.suse.de - 403 response: timestamp mismatch - ignoring server
Apr 19 12:00:41 openqa-piworker worker[1751]: [error] Stopping because registration with all configured web UI hosts failed
Apr 19 12:00:41 openqa-piworker systemd[1]: openqa-worker-plain@3.service: Succeeded.
What really seems to work(around the issue) is:
# /etc/systemd/system/openqa-worker-plain@.service.d/override.conf
[Service]
Restart=always
RestartSec=10s
Note that Restart=on-failure
won't work as openQA-worker returns 0 even in an error condition (which might make sense to be changed as well).
- Copied to action #113240: openqa-worker@.service returns 0 even when it fails to start added
Sorry about the delayed reply. I think Redmine even tried to remind me but the emails with custom queries were not very clear and I don't see future tickets in due date-queries.
dheidler wrote:
I did some tests using the new packages with the updated unit file, but it doesn't seem to do the trick:
Can you confirm that cronyd was started before?
Note that Restart=on-failure
won't work as openQA-worker returns 0 even in an error condition (which might make sense to be changed as well).
I filed #113240.
- Due date changed from 2022-06-30 to 2022-07-08
Apparently my due date update got lost
Yes - but the waiting for actual NTP sync that After=time-sync.target
should ensure, doesn't seem to work.
See the attached svg for systemd flow.
# systemd-analyze dot --order > /tmp/f.dot
Color legend: black = Requires
dark blue = Requisite
dark grey = Wants
red = Conflicts
green = After
# dot -Tsvg /tmp/f.dot > /tmp/g.svg
- Due date changed from 2022-07-08 to 2022-07-22
Marius suggests as another option we could catch the timestamp mismatch
and retry in this specific case so I will look into that.
cdywan wrote:
Marius suggests as another option we could catch the timestamp mismatch
and retry in this specific case so I will look into that.
I'm approaching this by re-using the code to retry like in other cases where registration failed but isn't a fatal error, and OPENQA_WORKER_CONNECT_INTERVAL is respected here as well: https://github.com/os-autoinst/openQA/pull/4751
Btw another potential work-around I saw while looking into that code: Authentication via personal access tokens instead of API keys is not affected by the timestamp issue.
- Status changed from Feedback to Resolved
I'd consider this resolved as both systemd-level support and explicit retry on timestamp errors have been implemented now.
Also available in: Atom
PDF