Project

General

Profile

Actions

action #105855

closed

[easy][beginner] openqa-worker@.service should handle NTP unavailability gracefully

Added by dheidler almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
QA (public, currently private due to #173521) - future
Start date:
2022-02-02
Due date:
2022-07-22
% Done:

0%

Estimated time:

Description

Observation

-- Logs begin at Thu 2022-01-06 12:00:01 UTC, end at Wed 2022-02-02 16:18:15 UTC. --
Jan 06 12:00:37 localhost systemd[1]: Starting openQA Worker #4...
Jan 06 12:00:37 localhost systemd[1]: Started openQA Worker #4.
Jan 06 12:00:40 localhost worker[1306]: [info] worker 4:
Jan 06 12:00:40 localhost worker[1306]:  - config file:           /etc/openqa/workers.ini
Jan 06 12:00:40 localhost worker[1306]:  - worker hostname:       localhost
Jan 06 12:00:40 localhost worker[1306]:  - isotovideo version:    24
Jan 06 12:00:40 localhost worker[1306]:  - websocket API version: 1
Jan 06 12:00:40 localhost worker[1306]:  - web UI hosts:          http://10.131.0.32
Jan 06 12:00:40 localhost worker[1306]:  - class:                 generalhw_RPi4
Jan 06 12:00:40 localhost worker[1306]:  - no cleanup:            no
Jan 06 12:00:40 localhost worker[1306]:  - pool directory:        /var/lib/openqa/pool/4
Jan 06 12:00:40 localhost worker[1306]: [info] Project dir for host http://10.131.0.32 is /var/lib/openqa/share
Jan 06 12:00:40 localhost worker[1306]: [info] Registering with openQA http://10.131.0.32
Jan 06 12:00:40 localhost worker[1306]: [error] Failed to register at http://10.131.0.32 - 403 response: timestamp mismatch - ignoring server
Jan 06 12:00:40 localhost worker[1306]: [error] Stopping because registration with all configured web UI hosts failed
Jan 06 12:00:40 localhost systemd[1]: openqa-worker@4.service: Succeeded.

This worker is running on a raspberry pi without a real time clock.
The output is from today (Feb 02) and chronyd is enabled on that machine.
Still the worker doesn't come up because the worker is started before NTP sync is done.

Acceptance criteria

  • AC1: openQA worker does not fail with an error and does not report an error when a time source is not (yet) available

Suggestions

  • Make the code itself resilient to time not being available yet, e.g. just wait until it becomes available

Further details

entrance level issue


Files

g.svg (906 KB) g.svg dheidler, 2022-07-07 09:08

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure (public) - action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in scheduleResolvednicksinger2022-06-01

Actions
Copied to openQA Project (public) - action #113240: openqa-worker@.service returns 0 even when it fails to startNew2022-02-02

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Category set to Feature requests
  • Target version set to future

I suggest to make the code itself resilient to time not being available yet, e.g. just wait until it becomes available

Actions #2

Updated by dheidler almost 3 years ago

That sounds even better.

Actions #3

Updated by okurz over 2 years ago

  • Tags set to easy, beginner, entrance level
  • Subject changed from openqa-worker@.service should depend on time-sync.target to [easy][beginner] openqa-worker@.service should handle NTP unavailability gracefully
  • Description updated (diff)
Actions #4

Updated by livdywan over 2 years ago

Adding Wants=time-sync.target to openqa-worker.target should make this even easier. We don't need to write code to handle this as long as whatever service is being used includes time-set.target in Before= and Wants=. And if nothing implements it it'll be racy like it is now:

https://github.com/os-autoinst/openQA/pull/4692

Actions #5

Updated by dheidler over 2 years ago

  • Related to action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in schedule added
Actions #6

Updated by livdywan over 2 years ago

  • Due date set to 2022-06-30
  • Status changed from New to Feedback
  • Assignee set to livdywan

cdywan wrote:

Adding Wants=time-sync.target to openqa-worker.target should make this even easier. We don't need to write code to handle this as long as whatever service is being used includes time-set.target in Before= and Wants=. And if nothing implements it it'll be racy like it is now:

https://github.com/os-autoinst/openQA/pull/4692

I guess I should take the ticket and monitor this for a while since this fulfills the AC.

Actions #7

Updated by dheidler over 2 years ago

I did some tests using the new packages with the updated unit file, but it doesn't seem to do the trick:

openqa-piworker:~ # systemctl cat openqa-worker@3
# /usr/lib/systemd/system/openqa-worker-plain@.service
# unit description file for openQA worker processes
# start using e.g.
# systemctl start openqa-worker@1.service
# replace '1' with the instance number you want
[Unit]
Description=openQA Worker #%i
After=openqa-slirpvde.service network.target nss-lookup.target remote-fs.target time-sync.target
PartOf=openqa-worker.target

[Service]
Type=simple
PermissionsStartOnly=True
ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/%i
ExecStart=/usr/share/openqa/script/worker --instance %i
User=_openqa-worker
KillMode=mixed
Restart=on-failure
Delegate=yes
Slice=openqa-worker.slice

[Install]
WantedBy=multi-user.target
openqa-piworker:~ # systemctl status openqa-worker@3
● openqa-worker-plain@3.service - openQA Worker #3
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-plain@.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Tue 2022-04-19 12:00:41 UTC; 1 month 21 days ago
    Process: 1705 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/3 (code=exited, status=0/SUCCESS)
    Process: 1751 ExecStart=/usr/share/openqa/script/worker --instance 3 (code=exited, status=0/SUCCESS)
   Main PID: 1751 (code=exited, status=0/SUCCESS)

Apr 19 12:00:41 openqa-piworker worker[1751]:  - websocket API version: 1
Apr 19 12:00:41 openqa-piworker worker[1751]:  - web UI hosts:          openqa.suse.de
Apr 19 12:00:41 openqa-piworker worker[1751]:  - class:                 generalhw_RPi3B
Apr 19 12:00:41 openqa-piworker worker[1751]:  - no cleanup:            no
Apr 19 12:00:41 openqa-piworker worker[1751]:  - pool directory:        /var/lib/openqa/pool/3
Apr 19 12:00:41 openqa-piworker worker[1751]: [info] Project dir for host openqa.suse.de is /var/lib/openqa/share
Apr 19 12:00:41 openqa-piworker worker[1751]: [info] Registering with openQA openqa.suse.de
Apr 19 12:00:41 openqa-piworker worker[1751]: [error] Failed to register at openqa.suse.de - 403 response: timestamp mismatch - ignoring server
Apr 19 12:00:41 openqa-piworker worker[1751]: [error] Stopping because registration with all configured web UI hosts failed
Apr 19 12:00:41 openqa-piworker systemd[1]: openqa-worker-plain@3.service: Succeeded.
Actions #8

Updated by dheidler over 2 years ago

What really seems to work(around the issue) is:

# /etc/systemd/system/openqa-worker-plain@.service.d/override.conf
[Service]
Restart=always
RestartSec=10s

Note that Restart=on-failure won't work as openQA-worker returns 0 even in an error condition (which might make sense to be changed as well).

Actions #9

Updated by livdywan over 2 years ago

  • Copied to action #113240: openqa-worker@.service returns 0 even when it fails to start added
Actions #10

Updated by livdywan over 2 years ago

Sorry about the delayed reply. I think Redmine even tried to remind me but the emails with custom queries were not very clear and I don't see future tickets in due date-queries.

dheidler wrote:

I did some tests using the new packages with the updated unit file, but it doesn't seem to do the trick:

Can you confirm that cronyd was started before?

Note that Restart=on-failure won't work as openQA-worker returns 0 even in an error condition (which might make sense to be changed as well).

I filed #113240.

Actions #11

Updated by livdywan over 2 years ago

  • Due date changed from 2022-06-30 to 2022-07-08

Apparently my due date update got lost

Actions #12

Updated by dheidler over 2 years ago

Yes - but the waiting for actual NTP sync that After=time-sync.target should ensure, doesn't seem to work.

See the attached svg for systemd flow.

# systemd-analyze dot --order > /tmp/f.dot
   Color legend: black     = Requires
                 dark blue = Requisite
                 dark grey = Wants
                 red       = Conflicts
                 green     = After
# dot -Tsvg /tmp/f.dot > /tmp/g.svg
Actions #13

Updated by livdywan over 2 years ago

  • Due date changed from 2022-07-08 to 2022-07-22

Marius suggests as another option we could catch the timestamp mismatch and retry in this specific case so I will look into that.

Actions #14

Updated by livdywan over 2 years ago

cdywan wrote:

Marius suggests as another option we could catch the timestamp mismatch and retry in this specific case so I will look into that.

I'm approaching this by re-using the code to retry like in other cases where registration failed but isn't a fatal error, and OPENQA_WORKER_CONNECT_INTERVAL is respected here as well: https://github.com/os-autoinst/openQA/pull/4751

Btw another potential work-around I saw while looking into that code: Authentication via personal access tokens instead of API keys is not affected by the timestamp issue.

Actions #15

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

I'd consider this resolved as both systemd-level support and explicit retry on timestamp errors have been implemented now.

Actions

Also available in: Atom PDF