Project

General

Profile

Actions

action #128942

closed

[alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M

Added by okurz 12 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
2023-05-08
Due date:
2023-06-23
% Done:

0%

Estimated time:

Description

Observation

alert email received 2023-05-07 03:59
see https://stats.openqa-monitor.qa.suse.de/d/GDopenqa-piworker/dashboard-for-openqa-piworker?orgId=1&viewPanel=65102&from=1683420019738&to=1683442855241

Acceptance criteria

  • AC1: No more NTP alerts for the worker
  • AC2: The clock on that worker is synchronized automatically

Suggestions

  • Check logs of chronyd on openqa-piworker
  • Maybe if there is no hardware or it can fulfill our requirements then explicitly exclude that alert for the piworker (I think we have excluded some other alert already)

Related issues 1 (0 open1 closed)

Copied to openQA Infrastructure - action #128945: [alert][grafana] web UI: Too many Minion job failures alert Salt (liA25iB4k)Resolvedlivdywan2023-05-082023-05-24

Actions
Actions #1

Updated by okurz 12 months ago

  • Copied to action #128945: [alert][grafana] web UI: Too many Minion job failures alert Salt (liA25iB4k) added
Actions #2

Updated by mkittler 12 months ago

  • Subject changed from [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) to [alert][grafana][openqa-piworker] NTP offset alert Generic (openqa-piworker ntp_offset_alert_openqa-piworker generic) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by livdywan 11 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'll take a look

Actions #4

Updated by livdywan 11 months ago

It seems as though chrony's poking the server a bit too frequently and we're getting Kiss of Death i.e. rate limiting messages:

sudo journalctl -u chronyd
May 16 04:05:52 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.21
May 16 04:45:46 openqa-piworker chronyd[1655]: Received KoD RATE from 45.131.109.2

Researching this a bit I'm thinking we may want to have explicit timings.

Actions #5

Updated by openqa_review 11 months ago

  • Due date set to 2023-05-31

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by livdywan 11 months ago

cdywan wrote:

Researching this a bit I'm thinking we may want to have explicit timings.

I also rebooted the machine just to confirm if/ how that may affect the issue:

May 17 10:49:16 openqa-piworker chronyd[1651]: System clock was stepped by 6727706.753087 seconds

Maybe we really have a lot of clock skew as there's no hardware clock? I'll try and research with a more pi-specific angle. And maybe to check how much the clock drifts in practice compared to other machines.

Actions #7

Updated by livdywan 11 months ago

Apparently chronyc is quite handy for checking the last and average offset from the correct time. It seems like even now it's very far behind:

> sudo salt -C '*' cmd.run 'sudo chronyc tracking | grep offset
QA-Power8-4-kvm.qa.suse.de:
    Last offset     : -0.000000336 seconds
    RMS offset      : 0.000014526 seconds
openqa-piworker.qa.suse.de:
    Last offset     : -0.000340357 seconds
    RMS offset      : 0.000133489 seconds
grenache-1.qa.suse.de:
    Last offset     : +0.000017157 seconds
    RMS offset      : 0.000028348 seconds

Also just to confirm the salt-configured servers get no preference:

May 17 10:50:21 openqa-piworker chronyd[1651]: Selected source 10.160.0.45 (ntp2.suse.de)
May 17 11:12:58 openqa-piworker chronyd[1651]: Selected source 212.18.3.19 (2.suse.pool.ntp.org)
Actions #8

Updated by livdywan 11 months ago

Updated comparison. Looking a little better without any changes:

schort-server:
    Last offset     : +0.000041056 seconds
    RMS offset      : 0.000017617 seconds
worker12.oqa.suse.de:
    Last offset     : +0.000000676 seconds
    RMS offset      : 0.000010247 seconds
openqa-piworker.qa.suse.de:
    Last offset     : +0.000068636 seconds
    RMS offset      : 0.000052300 seconds
Actions #9

Updated by livdywan 11 months ago

  • Status changed from In Progress to Resolved

The alerts are no longer showing up and the issue is not reproducible for now

Actions #10

Updated by livdywan 11 months ago

  • Status changed from Resolved to Feedback

cdywan wrote:

The alerts are no longer showing up and the issue is not reproducible for now

Unfortunately alerts were triggered again over the weekend. Maybe the fact that the issue resolved itself could be a cue to the solution. I\ll take a closer look shortly.

Actions #11

Updated by livdywan 11 months ago

Numbers from today for reference:

jenkins.qa.suse.de:
    Last offset     : +0.000048731 seconds
    RMS offset      : 0.000033733 seconds
openqa-piworker.qa.suse.de:
    Last offset     : -0.000231949 seconds
    RMS offset      : 0.000248844 seconds
grenache-1.qa.suse.de:
    Last offset     : +0.000027556 seconds
    RMS offset      : 0.000040324 seconds

And we're still getting the kiss of death:

May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
Actions #12

Updated by okurz 11 months ago

cdywan wrote:

And we're still getting the kiss of death:

May 22 23:42:00 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72
May 23 10:02:20 openqa-piworker chronyd[1651]: Received KoD RATE from 89.58.51.72

of course we receive that because nobody changed anything. And as previously mentioned it's likely a red herring. sudo salt --no-color \* cmd.run 'journalctl -u chronyd | grep "Received KoD" | wc -l shows that we have a plethora of "Received KoD RATE" messages mentioning various different hosts. I would not worry about that.

Have you considered excluding that specific host+service configuration from alerting? What are your plans for this ticket?

Actions #13

Updated by livdywan 11 months ago

As per upstream chronyd docs the standard suggestion would be to use -s so that's something we can try i.e. by systemctl edit chronyding this:

[Service]
ExecStart=
ExecStart=/usr/sbin/chronyd -s -d

However I found little evidence this was being used on raspberry pie setups. So I searched again and it seems there's a custom service called fake.hwclock which transparently restores the clock on RaspianOS. A relevant-looking issue on balena-os.

There's meanwhile https://github.com/kristjanvalur/soft-hwclock which we could use.

And apparently systemd-timesyncd does what fake-hwclock does automatically, and
also does SNTP. it should be fine for most usecases, no need to resort
to fake-hwclock
according to the person who wrote it.

Actions #14

Updated by okurz 11 months ago

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

Actions #15

Updated by livdywan 11 months ago

okurz wrote:

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

I'm not sure how setting the wrong time on purpose would help.

The point I was trying to make was related to saving the synched time. Or rather it getting overwritten during shutdown i.e. as described in the balena-os issue the clock is reset to the wrong time at shutdown. I assume we'd want either systemd-timesyncd or soft-hwclock because they deal with the shutdown case.

Actions #16

Updated by okurz 11 months ago

cdywan wrote:

okurz wrote:

all those approaches sound good but are only useful to look into after we can actually reproduce the problem more easily. Maybe try to set a completely wrong time and then trigger a reboot?

I'm not sure how setting the wrong time on purpose would help.

It might help to more easily reproduce the problem so that afterwards you can try out your ideas for potential fixes

Actions #17

Updated by livdywan 11 months ago

Maybe try to set a completely wrong time and then trigger a reboot?

sudo date -s "2 OCT 2006 18:00:00"
timedatectl
               Local time: Mon 2006-10-02 18:00:15 CEST
sudo journalctl -u chronyd
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Backward time jump detected!
Oct 02 18:00:07 openqa-piworker chronyd[1650]: Can't synchronise: no selectable sources
Oct 02 18:03:20 openqa-piworker chronyd[1650]: Selected source 116.203.219.116 (2.suse.pool.ntp.org)
Oct 02 18:03:20 openqa-piworker chronyd[1650]: System clock wrong by 525622822.628523 seconds
sudo systemctl reboot
Oct 02 18:20:57 openqa-piworker chronyd[1650]: chronyd exiting
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopping NTP client/server...
Oct 02 18:20:57 openqa-piworker systemd[1]: chronyd.service: Deactivated successfully.
Oct 02 18:20:57 openqa-piworker systemd[1]: Stopped NTP client/server.
-- Boot fcf5cbf3a5ac41d7bc926518fadd717b --
Feb 28 13:00:43 openqa-piworker systemd[1]: Starting NTP client/server...
Feb 28 13:00:43 openqa-piworker chronyd[1655]: chronyd version 4.1 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYN>
Feb 28 13:00:43 openqa-piworker chronyd[1655]: Frequency 9.831 +/- 0.490 ppm read from /var/lib/chrony/drift
Feb 28 13:00:44 openqa-piworker systemd[1]: Started NTP client/server.
Feb 28 13:00:49 openqa-piworker chronyd[1655]: Selected source 10.160.255.254 (ntp3.suse.de)
Feb 28 13:00:49 openqa-piworker chronyd[1655]: System clock wrong by 7843246.825184 seconds
May 30 08:41:36 openqa-piworker chronyd[1655]: System clock was stepped by 7843246.825184 seconds
May 30 08:42:42 openqa-piworker chronyd[1655]: Selected source 10.160.0.45 (ntp2.suse.de
Actions #18

Updated by livdywan 11 months ago

  • Due date changed from 2023-05-31 to 2023-06-09
  • Assignee changed from livdywan to nicksinger

As per discussion Nick ordered an RTC. We'd rather not put much effort into pi-specific tweaks if we can get better results with an actual hardware clock.
Let's assume delivery takes a day or two and it can be installed next week.

Actions #19

Updated by nicksinger 11 months ago

I got the delivery notification today. Given my previous experience with pollin.de, I don't expect the item to arrive before Monday.

Actions #20

Updated by okurz 11 months ago

  • Due date changed from 2023-06-09 to 2023-06-23
  • Status changed from Feedback to Workable

ok, as you stated you have the device already. So next week you can try to put it in. Updated the due date accordingly.

Actions #21

Updated by okurz 10 months ago

  • Priority changed from High to Urgent

Raising prio due to alert from today

Actions #22

Updated by okurz 10 months ago

Discussed in daily 2023-06-12 @nicksinger will add an alert silence and during the course of this week assemble the RTC setup.

Actions #23

Updated by nicksinger 10 months ago

  • Priority changed from Urgent to Normal
Actions #24

Updated by nicksinger 10 months ago

  • Status changed from Workable to In Progress

Going to the office now to build in the RTC

Actions #25

Updated by nicksinger 10 months ago

RTC mounted and currently manually configured for testing:

# cat extraconfig.txt
dtparam=i2c_arm=on
device_tree=bcm2711-rpi-4-b.dtb
dtoverlay=i2c-rtc,ds1307
Actions #26

Updated by nicksinger 10 months ago

after a reboot the changes worked as expected and the RTC was detected by the system:

# dmesg | grep rtc
[   15.793539] rtc-ds1307 1-0068: registered as rtc0
[   15.796070] rtc-ds1307 1-0068: hctosys: unable to read the hardware clock

had to do "hwclock -w" to "flash" the current system time onto the RTC. After another reboot the warning regarding "unable to read" went away too:

# dmesg | grep rtc
[   15.455293] rtc-ds1307 1-0068: registered as rtc0
[   15.456510] rtc-ds1307 1-0068: setting system clock to 2023-06-13T13:00:01 UTC (1686661201)

so the only missing part is adding this to salt and enabling the alert again.

Actions #27

Updated by nicksinger 10 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/885
We currently face issues with the deployment but since I already configured it manually I will disable our alert silence again and see if we see improvements of the situation.

Actions #28

Updated by nicksinger 10 months ago

  • Status changed from In Progress to Resolved

deployment is fixed now and I re-enabled https://stats.openqa-monitor.qa.suse.de/alerting/list?view=list&search=openqa-piworker:%20NTP%20offset%20alert again. Last alert was on 2023-06-11 so it seems stable for now.

Actions

Also available in: Atom PDF