action #128942: [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #128942

closed

[alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M

Added by okurz about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

nicksinger

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-05-08

Due date:

2023-06-23

% Done:

Estimated time:

Tags:

alert, infra, openqa-piworker

Description

Observation¶

alert email received 2023-05-07 03:59
see https://stats.openqa-monitor.qa.suse.de/d/GDopenqa-piworker/dashboard-for-openqa-piworker?orgId=1&viewPanel=65102&from=1683420019738&to=1683442855241

Acceptance criteria¶

AC1: No more NTP alerts for the worker
AC2: The clock on that worker is synchronized automatically

Suggestions¶

Check logs of chronyd on openqa-piworker
Maybe if there is no hardware or it can fulfill our requirements then explicitly exclude that alert for the piworker (I think we have excluded some other alert already)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz about 2 years ago

Copied to action #128945: [alert][grafana] web UI: Too many Minion job failures alert Salt (liA25iB4k) added

Actions

Copy link

Updated by mkittler about 2 years ago

Subject changed from [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) to [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan about 2 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

I'll take a look

Actions

Copy link

Updated by livdywan about 2 years ago

It seems as though chrony's poking the server a bit too frequently and we're getting Kiss of Death i.e. rate limiting messages:

sudo journalctl -u chronyd
May 16 04:05:52 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.21
May 16 04:45:46 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.2

Researching this a bit I'm thinking we may want to have explicit timings.

Actions

Copy link

Updated by openqa_review about 2 years ago

Due date set to 2023-05-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by livdywan about 2 years ago

cdywan wrote:

Researching this a bit I'm thinking we may want to have explicit timings.

I also rebooted the machine just to confirm if/ how that may affect the issue:

May 17 10:49:16 openqa-piworker chronyd[1651]: System clock was stepped by 6727706.753087 seconds

Maybe we really have a lot of clock skew as there's no hardware clock? I'll try and research with a more pi-specific angle. And maybe to check how much the clock drifts in practice compared to other machines.

Actions

Copy link

Updated by livdywan about 2 years ago

Apparently chronyc is quite handy for checking the last and average offset from the correct time. It seems like even now it's very far behind:

> sudo salt -C '*' cmd.run 'sudo chronyc tracking | grep offset
QA-Power8-4-kvm.qa.suse.de:
    Last offset     : -0.000000336 seconds
    RMS offset      : 0.000014526 seconds
openqa-piworker.qa.suse.de:
    Last offset     : -0.000340357 seconds
    RMS offset      : 0.000133489 seconds
grenache-1.qa.suse.de:
    Last offset     : +0.000017157 seconds
    RMS offset      : 0.000028348 seconds

Also just to confirm the salt-configured servers get no preference:

May 17 10:50:21 openqa-piworker chronyd[1651]: Selected source 10.160.0.45 (ntp2.suse.de)
May 17 11:12:58 openqa-piworker chronyd[1651]: Selected source 212.18.3.19 (2.suse.pool.ntp.org)

Actions

Copy link

Updated by livdywan about 2 years ago

Updated comparison. Looking a little better without any changes:

schort-server:
    Last offset     : +0.000041056 seconds
    RMS offset      : 0.000017617 seconds
worker12.oqa.suse.de:
    Last offset     : +0.000000676 seconds
    RMS offset      : 0.000010247 seconds
openqa-piworker.qa.suse.de:
    Last offset     : +0.000068636 seconds
    RMS offset      : 0.000052300 seconds

Actions

Copy link

Updated by livdywan about 2 years ago

Status changed from In Progress to Resolved

The alerts are no longer showing up and the issue is not reproducible for now

Actions

Copy link

#10

Updated by livdywan about 2 years ago

Status changed from Resolved to Feedback

cdywan wrote:

The alerts are no longer showing up and the issue is not reproducible for now

Unfortunately alerts were triggered again over the weekend. Maybe the fact that the issue resolved itself could be a cue to the solution. I\ll take a closer look shortly.

Actions

Copy link

#11

Updated by livdywan about 2 years ago

Numbers from today for reference:

jenkins.qa.suse.de:
    Last offset     : +0.000048731 seconds
    RMS offset      : 0.000033733 seconds
openqa-piworker.qa.suse.de:
    Last offset     : -0.000231949 seconds
    RMS offset      : 0.000248844 seconds
grenache-1.qa.suse.de:
    Last offset     : +0.000027556 seconds
    RMS offset      : 0.000040324 seconds

And we're still getting the kiss of death:

May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72

Actions

Copy link

#12

Updated by okurz about 2 years ago

cdywan wrote:

And we're still getting the kiss of death:

May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72

of course we receive that because nobody changed anything. And as previously mentioned it's likely a red herring. sudo salt --no-color \* cmd.run 'journalctl -u chronyd | grep "Received KoD" | wc -l shows that we have a plethora of "Received KoD RATE" messages mentioning various different hosts. I would not worry about that.

Have you considered excluding that specific host+service configuration from alerting? What are your plans for this ticket?

Actions

Copy link

#13

Updated by livdywan about 2 years ago

As per upstream chronyd docs the standard suggestion would be to use -s so that's something we can try i.e. by systemctl edit chronyding this:

[Service]
ExecStart=
ExecStart=/usr/sbin/chronyd -s -d

However I found little evidence this was being used on raspberry pie setups. So I searched again and it seems there's a custom service called fake.hwclock which transparently restores the clock on RaspianOS. A relevant-looking issue on balena-os.

There's meanwhile https://github.com/kristjanvalur/soft-hwclock which we could use.

And apparently systemd-timesyncd does what fake-hwclock does automatically, and
also does SNTP. it should be fine for most usecases, no need to resort
to fake-hwclock according to the person who wrote it.

Actions

Copy link

#14

Updated by okurz about 2 years ago

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

Actions

Copy link

#15

Updated by livdywan about 2 years ago

okurz wrote:

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

I'm not sure how setting the wrong time on purpose would help.

The point I was trying to make was related to saving the synched time. Or rather it getting overwritten during shutdown i.e. as described in the balena-os issue the clock is reset to the wrong time at shutdown. I assume we'd want either systemd-timesyncd or soft-hwclock because they deal with the shutdown case.

Actions

Copy link

#16

Updated by okurz about 2 years ago

cdywan wrote:

okurz wrote:

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

I'm not sure how setting the wrong time on purpose would help.

It might help to more easily reproduce the problem so that afterwards you can try out your ideas for potential fixes

Actions

Copy link

#17

Updated by livdywan about 2 years ago

Maybe try to set a completely wrong time and then trigger a reboot?

sudo date -s "2 OCT 2006 18:00:00"
timedatectl
               Local time: Mon 2006-10-02 18:00:15 CEST
sudo journalctl -u chronyd
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Backward time jump detected!
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Can't synchronise: no selectable sources
Oct 02 18:03:20 openqa-piworker chronyd[1650]: Selected source 116.203.219.116 (2.suse.pool.ntp.org)
Oct 02 18:03:20 openqa-piworker chronyd[1650]: System clock wrong by 525622822.628523 seconds
sudo systemctl reboot
Oct 02 18:20:57 openqa-piworker chronyd[1650]: chronyd exiting
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopping NTP client/server...
Oct 02 18:20:57 openqa-piworker systemd[1]: chronyd.service: Deactivated successfully.
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopped NTP client/server.
-- Boot fcf5cbf3a5ac41d7bc926518fadd717b --
Feb 28 13:00:43 openqa-piworker systemd[1]: Starting NTP client/server...
Feb 28 13:00:43 openqa-piworker chronyd[1655]: chronyd version 4.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYN>
Feb 28 13:00:43 openqa-piworker chronyd[1655]: Frequency 9.831 +/- 0.490 ppm read from /var/lib/chrony/drift
Feb 28 13:00:44 openqa-piworker systemd[1]: Started NTP client/server.
Feb 28 13:00:49 openqa-piworker chronyd[1655]: Selected source 10.160.255.254 (ntp3.suse.de)
Feb 28 13:00:49 openqa-piworker chronyd[1655]: System clock wrong by 7843246.825184 seconds
May 30 08:41:36 openqa-piworker chronyd[1655]: System clock was stepped by 7843246.825184 seconds
May 30 08:42:42 openqa-piworker chronyd[1655]: Selected source 10.160.0.45 (ntp2.suse.de

Actions

Copy link

#18

Updated by livdywan about 2 years ago

Due date changed from 2023-05-31 to 2023-06-09
Assignee changed from livdywan to nicksinger

As per discussion Nick ordered an RTC. We'd rather not put much effort into pi-specific tweaks if we can get better results with an actual hardware clock.
Let's assume delivery takes a day or two and it can be installed next week.

Actions

Copy link

#19

Updated by nicksinger about 2 years ago

I got the delivery notification today. Given my previous experience with pollin.de, I don't expect the item to arrive before Monday.

Actions

Copy link

#20

Updated by okurz almost 2 years ago

Due date changed from 2023-06-09 to 2023-06-23
Status changed from Feedback to Workable

ok, as you stated you have the device already. So next week you can try to put it in. Updated the due date accordingly.

Actions

Copy link

#21

Updated by okurz almost 2 years ago

Priority changed from High to Urgent

Raising prio due to alert from today

Actions

Copy link

#22

Updated by okurz almost 2 years ago

Discussed in daily 2023-06-12 @nicksinger will add an alert silence and during the course of this week assemble the RTC setup.

Actions

Copy link

#23

Updated by nicksinger almost 2 years ago

Priority changed from Urgent to Normal

Added silence in https://stats.openqa-monitor.qa.suse.de/alerting/silence/d91b0090-defb-407b-a87d-a130af27d426/edit?alertmanager=grafana which reduces priority.

Actions

Copy link

#24

Updated by nicksinger almost 2 years ago

Status changed from Workable to In Progress

Going to the office now to build in the RTC

Actions

Copy link

#25

Updated by nicksinger almost 2 years ago

RTC mounted and currently manually configured for testing:

# cat extraconfig.txt
dtparam=i2c_arm=on
device_tree=bcm2711-rpi-4-b.dtb
dtoverlay=i2c-rtc,ds1307

Actions

Copy link

#26

Updated by nicksinger almost 2 years ago

after a reboot the changes worked as expected and the RTC was detected by the system:

# dmesg | grep rtc
[   15.793539] rtc-ds1307 1-0068: registered as rtc0
[   15.796070] rtc-ds1307 1-0068: hctosys: unable to read the hardware clock

had to do "hwclock -w" to "flash" the current system time onto the RTC. After another reboot the warning regarding "unable to read" went away too:

# dmesg | grep rtc
[   15.455293] rtc-ds1307 1-0068: registered as rtc0
[   15.456510] rtc-ds1307 1-0068: setting system clock to 2023-06-13T13:00:01 UTC (1686661201)

so the only missing part is adding this to salt and enabling the alert again.

Actions

Copy link

#27

Updated by nicksinger almost 2 years ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/885
We currently face issues with the deployment but since I already configured it manually I will disable our alert silence again and see if we see improvements of the situation.

Actions

Copy link

#28

Updated by nicksinger almost 2 years ago

Status changed from In Progress to Resolved

deployment is fixed now and I re-enabled https://stats.openqa-monitor.qa.suse.de/alerting/list?view=list&search=openqa-piworker:%20NTP%20offset%20alert again. Last alert was on 2023-06-11 so it seems stable for now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #128942

[alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by openqa_review about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by nicksinger about 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by okurz almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago

Updated by nicksinger almost 2 years ago