action #128942
closed[alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M
Added by okurz over 1 year ago. Updated over 1 year ago.
0%
Description
Observation¶
alert email received 2023-05-07 03:59
see https://stats.openqa-monitor.qa.suse.de/d/GDopenqa-piworker/dashboard-for-openqa-piworker?orgId=1&viewPanel=65102&from=1683420019738&to=1683442855241
Acceptance criteria¶
- AC1: No more NTP alerts for the worker
- AC2: The clock on that worker is synchronized automatically
Suggestions¶
- Check logs of chronyd on openqa-piworker
- Maybe if there is no hardware or it can fulfill our requirements then explicitly exclude that alert for the piworker (I think we have excluded some other alert already)
Updated by okurz over 1 year ago
- Copied to action #128945: [alert][grafana] web UI: Too many Minion job failures alert Salt (liA25iB4k) added
Updated by mkittler over 1 year ago
- Subject changed from [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) to [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by livdywan over 1 year ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
I'll take a look
Updated by livdywan over 1 year ago
It seems as though chrony's poking the server a bit too frequently and we're getting Kiss of Death i.e. rate limiting messages:
sudo journalctl -u chronyd
May 16 04:05:52 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.21
May 16 04:45:46 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.2
Researching this a bit I'm thinking we may want to have explicit timings.
Updated by openqa_review over 1 year ago
- Due date set to 2023-05-31
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 1 year ago
cdywan wrote:
Researching this a bit I'm thinking we may want to have explicit timings.
I also rebooted the machine just to confirm if/ how that may affect the issue:
May 17 10:49:16 openqa-piworker chronyd[1651]: System clock was stepped by 6727706.753087 seconds
Maybe we really have a lot of clock skew as there's no hardware clock? I'll try and research with a more pi-specific angle. And maybe to check how much the clock drifts in practice compared to other machines.
Updated by livdywan over 1 year ago
Apparently chronyc is quite handy for checking the last and average offset from the correct time. It seems like even now it's very far behind:
> sudo salt -C '*' cmd.run 'sudo chronyc tracking | grep offset
QA-Power8-4-kvm.qa.suse.de:
Last offset : -0.000000336 seconds
RMS offset : 0.000014526 seconds
openqa-piworker.qa.suse.de:
Last offset : -0.000340357 seconds
RMS offset : 0.000133489 seconds
grenache-1.qa.suse.de:
Last offset : +0.000017157 seconds
RMS offset : 0.000028348 seconds
Also just to confirm the salt-configured servers get no preference:
May 17 10:50:21 openqa-piworker chronyd[1651]: Selected source 10.160.0.45 (ntp2.suse.de)
May 17 11:12:58 openqa-piworker chronyd[1651]: Selected source 212.18.3.19 (2.suse.pool.ntp.org)
Updated by livdywan over 1 year ago
Updated comparison. Looking a little better without any changes:
schort-server:
Last offset : +0.000041056 seconds
RMS offset : 0.000017617 seconds
worker12.oqa.suse.de:
Last offset : +0.000000676 seconds
RMS offset : 0.000010247 seconds
openqa-piworker.qa.suse.de:
Last offset : +0.000068636 seconds
RMS offset : 0.000052300 seconds
Updated by livdywan over 1 year ago
- Status changed from In Progress to Resolved
The alerts are no longer showing up and the issue is not reproducible for now
Updated by livdywan over 1 year ago
- Status changed from Resolved to Feedback
cdywan wrote:
The alerts are no longer showing up and the issue is not reproducible for now
Unfortunately alerts were triggered again over the weekend. Maybe the fact that the issue resolved itself could be a cue to the solution. I\ll take a closer look shortly.
Updated by livdywan over 1 year ago
Numbers from today for reference:
jenkins.qa.suse.de:
Last offset : +0.000048731 seconds
RMS offset : 0.000033733 seconds
openqa-piworker.qa.suse.de:
Last offset : -0.000231949 seconds
RMS offset : 0.000248844 seconds
grenache-1.qa.suse.de:
Last offset : +0.000027556 seconds
RMS offset : 0.000040324 seconds
And we're still getting the kiss of death:
May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
Updated by okurz over 1 year ago
cdywan wrote:
And we're still getting the kiss of death:
May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72 May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
of course we receive that because nobody changed anything. And as previously mentioned it's likely a red herring. sudo salt --no-color \* cmd.run 'journalctl -u chronyd | grep "Received KoD" | wc -l
shows that we have a plethora of "Received KoD RATE" messages mentioning various different hosts. I would not worry about that.
Have you considered excluding that specific host+service configuration from alerting? What are your plans for this ticket?
Updated by livdywan over 1 year ago
As per upstream chronyd docs the standard suggestion would be to use -s
so that's something we can try i.e. by systemctl edit chronyd
ing this:
[Service]
ExecStart=
ExecStart=/usr/sbin/chronyd -s -d
However I found little evidence this was being used on raspberry pie setups. So I searched again and it seems there's a custom service called fake.hwclock which transparently restores the clock on RaspianOS. A relevant-looking issue on balena-os.
There's meanwhile https://github.com/kristjanvalur/soft-hwclock which we could use.
And apparently systemd-timesyncd does what fake-hwclock does automatically, and
also does SNTP. it should be fine for most usecases, no need to resort
to fake-hwclock according to the person who wrote it.
Updated by okurz over 1 year ago
all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?
Updated by livdywan over 1 year ago
okurz wrote:
all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?
I'm not sure how setting the wrong time on purpose would help.
The point I was trying to make was related to saving the synched time. Or rather it getting overwritten during shutdown i.e. as described in the balena-os issue the clock is reset to the wrong time at shutdown. I assume we'd want either systemd-timesyncd or soft-hwclock because they deal with the shutdown case.
Updated by okurz over 1 year ago
cdywan wrote:
okurz wrote:
all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?
I'm not sure how setting the wrong time on purpose would help.
It might help to more easily reproduce the problem so that afterwards you can try out your ideas for potential fixes
Updated by livdywan over 1 year ago
Maybe try to set a completely wrong time and then trigger a reboot?
sudo date -s "2 OCT 2006 18:00:00"
timedatectl
Local time: Mon 2006-10-02 18:00:15 CEST
sudo journalctl -u chronyd
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Backward time jump detected!
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Can't synchronise: no selectable sources
Oct 02 18:03:20 openqa-piworker chronyd[1650]: Selected source 116.203.219.116 (2.suse.pool.ntp.org)
Oct 02 18:03:20 openqa-piworker chronyd[1650]: System clock wrong by 525622822.628523 seconds
sudo systemctl reboot
Oct 02 18:20:57 openqa-piworker chronyd[1650]: chronyd exiting
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopping NTP client/server...
Oct 02 18:20:57 openqa-piworker systemd[1]: chronyd.service: Deactivated successfully.
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopped NTP client/server.
-- Boot fcf5cbf3a5ac41d7bc926518fadd717b --
Feb 28 13:00:43 openqa-piworker systemd[1]: Starting NTP client/server...
Feb 28 13:00:43 openqa-piworker chronyd[1655]: chronyd version 4.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYN>
Feb 28 13:00:43 openqa-piworker chronyd[1655]: Frequency 9.831 +/- 0.490 ppm read from /var/lib/chrony/drift
Feb 28 13:00:44 openqa-piworker systemd[1]: Started NTP client/server.
Feb 28 13:00:49 openqa-piworker chronyd[1655]: Selected source 10.160.255.254 (ntp3.suse.de)
Feb 28 13:00:49 openqa-piworker chronyd[1655]: System clock wrong by 7843246.825184 seconds
May 30 08:41:36 openqa-piworker chronyd[1655]: System clock was stepped by 7843246.825184 seconds
May 30 08:42:42 openqa-piworker chronyd[1655]: Selected source 10.160.0.45 (ntp2.suse.de
Updated by livdywan over 1 year ago
- Due date changed from 2023-05-31 to 2023-06-09
- Assignee changed from livdywan to nicksinger
As per discussion Nick ordered an RTC. We'd rather not put much effort into pi-specific tweaks if we can get better results with an actual hardware clock.
Let's assume delivery takes a day or two and it can be installed next week.
Updated by nicksinger over 1 year ago
I got the delivery notification today. Given my previous experience with pollin.de, I don't expect the item to arrive before Monday.
Updated by okurz over 1 year ago
- Due date changed from 2023-06-09 to 2023-06-23
- Status changed from Feedback to Workable
ok, as you stated you have the device already. So next week you can try to put it in. Updated the due date accordingly.
Updated by okurz over 1 year ago
- Priority changed from High to Urgent
Raising prio due to alert from today
Updated by okurz over 1 year ago
Discussed in daily 2023-06-12 @nicksinger will add an alert silence and during the course of this week assemble the RTC setup.
Updated by nicksinger over 1 year ago
- Priority changed from Urgent to Normal
Added silence in https://stats.openqa-monitor.qa.suse.de/alerting/silence/d91b0090-defb-407b-a87d-a130af27d426/edit?alertmanager=grafana which reduces priority.
Updated by nicksinger over 1 year ago
- Status changed from Workable to In Progress
Going to the office now to build in the RTC
Updated by nicksinger over 1 year ago
RTC mounted and currently manually configured for testing:
# cat extraconfig.txt
dtparam=i2c_arm=on
device_tree=bcm2711-rpi-4-b.dtb
dtoverlay=i2c-rtc,ds1307
Updated by nicksinger over 1 year ago
after a reboot the changes worked as expected and the RTC was detected by the system:
# dmesg | grep rtc
[ 15.793539] rtc-ds1307 1-0068: registered as rtc0
[ 15.796070] rtc-ds1307 1-0068: hctosys: unable to read the hardware clock
had to do "hwclock -w" to "flash" the current system time onto the RTC. After another reboot the warning regarding "unable to read" went away too:
# dmesg | grep rtc
[ 15.455293] rtc-ds1307 1-0068: registered as rtc0
[ 15.456510] rtc-ds1307 1-0068: setting system clock to 2023-06-13T13:00:01 UTC (1686661201)
so the only missing part is adding this to salt and enabling the alert again.
Updated by nicksinger over 1 year ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/885
We currently face issues with the deployment but since I already configured it manually I will disable our alert silence again and see if we see improvements of the situation.
Updated by nicksinger over 1 year ago
- Status changed from In Progress to Resolved
deployment is fixed now and I re-enabled https://stats.openqa-monitor.qa.suse.de/alerting/list?view=list&search=openqa-piworker:%20NTP%20offset%20alert again. Last alert was on 2023-06-11 so it seems stable for now.