Project

General

Profile

action #89113

[alert] PROBLEM Service Alert: openqa.suse.de/NTP Time is CRITICAL

Added by okurz 5 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-02-25
Due date:
% Done:

0%

Estimated time:

Description

Observation

alert notification email received:

Notification: PROBLEM
Host:         openqa.suse.de
State:        CRITICAL
Date/Time:    Thu Feb 25 12:19:12 UTC 2021
Info:         CRIT - found 5 peers, but none is suitable, this is 60 min since last successful sync, (levels at 5 min/60 min)(!!)

Service:      NTP Time

See Online: https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=NTP%20Time


Related issues

Related to openQA Infrastructure - action #92113: [Alerting] openqaworker-arm-3: NTP offset alertResolved2021-05-04

History

#1 Updated by mkittler 5 months ago

I can not access the thruk link. Note that our own NTP offset monitoring also shows this but the alert didn't fire because it was below the thresold: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&editPanel=86&tab=alert&from=1614232715204&to=1614281591117

The graph also shows that the problem is no longer apparent.

So we likely don't have a real problem here. Maybe we can disable the additional monitoring in thruk?

#2 Updated by okurz 5 months ago

mkittler wrote:

I can not access the thruk link.

I think it should be possible for every team member to be able to access thruk.suse.de. Can you create an EngInfra ticket for that?

So we likely don't have a real problem here. Maybe we can disable the additional monitoring in thruk?

Yes, we can disable monitoring in thruk - which is by the way only a congregation from icinga/nagios and AFAIU other sources - so can you also create a ticket for that?

#3 Updated by okurz 4 months ago

  • Status changed from New to Blocked
  • Assignee set to okurz

Created ticket

https://infra.nue.suse.com/SelfService/Display.html?id=187094

Hi,
the nagios/icinga alert
https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=NTP%20Time#pnp_th4/1613333628/1616098428/0
had been triggering multiple times in past weeks and turned to OK without us doing anything. Can you please bump up the alert thresholds or limits so that this alert does not trigger that quickly?

Internal tracking issue: https://progress.opensuse.org/issues/89113

Thanks and have fun,
Oliver

#5 Updated by okurz 3 months ago

  • Status changed from Blocked to Feedback

temporarily adjusted the ntp config in /etc/ntp.conf as suggested by mcaj

# server 127.127.1.0            # local clock (LCL)
# okurz: 2021-04-29: temporarily change configuration, see https://progress.opensuse.org/issues/89113
#server ntp1.suse.de
#server ntp2.suse.de
#server ntp3.suse.de
#server 0.de.pool.ntp.org
#server 1.de.pool.ntp.org
server nueo-p-infoblox.corp.suse.com
server frao-p-infoblox-01.corp.suse.com
server frao-p-infoblox-02.corp.suse.com

To prevent salt from changing the file I changed permission to be read-only.

#6 Updated by okurz 3 months ago

mcaj asks us to use the above configuration permanently: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/483

#7 Updated by okurz 3 months ago

MR was merged but today we again have alerts, see https://thruk.suse.de/thruk/cgi-bin/extinfo.cgi?type=2&host=openqa.suse.de&service=NTP%20Time#pnp_th2/1620208716/1620298716/0 . Will report in EngInfra ticket.

EDIT: reported in infra ticket hence it was reopened.

Checked on osd /var/log/ntp with nsinger and we could find entries:

1 Apr 07:29:01 ntpd[1469]: no peer for too long, server running free now
8 Apr 17:16:01 ntpd[1469]: no peer for too long, server running free now

coinciding with alerts on https://nagios-devel.suse.de/pnp4nagios/graph?host=openqa.suse.de&srv=NTP_Time&start=1619092278&end=1619097228 and such. On startup of ntp there are error reports like

25 Apr 03:31:10 ntpd[2168]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
25 Apr 03:31:10 ntpd[2168]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

likely because osd is a VM.

We should try chrony instead and see if it differs. TODO configure chrony, disable ntp, add to salt, monitor over time.

#8 Updated by okurz 2 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (okurz)

We should try chrony instead and see if it differs. TODO configure chrony, disable ntp, add to salt, monitor over time.

#9 Updated by okurz 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

using openqaworker11 to try

#10 Updated by okurz 2 months ago

  • Due date set to 2021-05-26
  • Status changed from In Progress to Feedback

#11 Updated by okurz 2 months ago

  • Status changed from Feedback to In Progress

MR merged. I did the equivalent changes on o3 as well. sudo chronyc tracking && for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "chronyc tracking"; done shows

Reference ID    : 904C4C6B (sv1.ggsrv.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:17:35 2021
System time     : 0.000153098 seconds fast of NTP time
Last offset     : +0.000017135 seconds
RMS offset      : 0.000767382 seconds
Frequency       : 26.469 ppm fast
Residual freq   : +0.002 ppm
Skew            : 0.183 ppm
Root delay      : 0.015452472 seconds
Root dispersion : 0.000430707 seconds
Update interval : 65.2 seconds
Leap status     : Normal
aarch64
Reference ID    : 55DCBEF6 (ernie.gerger-net.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:08:09 2021
System time     : 0.000023854 seconds slow of NTP time
Last offset     : -0.000063988 seconds
RMS offset      : 0.000169209 seconds
Frequency       : 4.631 ppm slow
Residual freq   : -0.001 ppm
Skew            : 0.019 ppm
Root delay      : 0.004383422 seconds
Root dispersion : 0.001607093 seconds
Update interval : 1028.3 seconds
Leap status     : Normal
openqaworker1
Reference ID    : 55DCBEF6 (ernie.gerger-net.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:16:06 2021
System time     : 0.000044388 seconds slow of NTP time
Last offset     : -0.000039339 seconds
RMS offset      : 0.000090489 seconds
Frequency       : 55.231 ppm slow
Residual freq   : -0.002 ppm
Skew            : 0.051 ppm
Root delay      : 0.004683669 seconds
Root dispersion : 0.000591569 seconds
Update interval : 1038.5 seconds
Leap status     : Normal
openqaworker4
Reference ID    : 55DCBEF6 (ernie.gerger-net.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:05:59 2021
System time     : 0.000025177 seconds slow of NTP time
Last offset     : -0.000053280 seconds
RMS offset      : 0.000154298 seconds
Frequency       : 70.561 ppm slow
Residual freq   : -0.001 ppm
Skew            : 0.063 ppm
Root delay      : 0.004528792 seconds
Root dispersion : 0.001632467 seconds
Update interval : 1030.2 seconds
Leap status     : Normal
openqaworker7
Reference ID    : D5EFEFA5 (ntp2.hetzner.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:04:51 2021
System time     : 0.000329892 seconds slow of NTP time
Last offset     : -0.000039698 seconds
RMS offset      : 0.000181923 seconds
Frequency       : 68.025 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.016 ppm
Root delay      : 0.020841513 seconds
Root dispersion : 0.011083659 seconds
Update interval : 1040.2 seconds
Leap status     : Normal
power8
506 Cannot talk to daemon
imagetester
Reference ID    : 904C9F97 (rag.9t4.net)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:16:47 2021
System time     : 0.000277323 seconds fast of NTP time
Last offset     : +0.000017476 seconds
RMS offset      : 0.000044955 seconds
Frequency       : 51.457 ppm fast
Residual freq   : -0.000 ppm
Skew            : 0.018 ppm
Root delay      : 0.012334459 seconds
Root dispersion : 0.001215961 seconds
Update interval : 65.0 seconds
Leap status     : Normal
rebel
Reference ID    : 55DCBEF6 (ernie.gerger-net.de)
Stratum         : 3
Ref time (UTC)  : Wed May 19 17:06:22 2021
System time     : 0.000012206 seconds slow of NTP time
Last offset     : +0.000002486 seconds
RMS offset      : 0.000026722 seconds
Frequency       : 46.371 ppm fast
Residual freq   : +0.000 ppm
Skew            : 0.015 ppm
Root delay      : 0.004412304 seconds
Root dispersion : 0.001652714 seconds
Update interval : 1044.4 seconds
Leap status     : Normal

For OSD I found that the chrony config on monitor.qa is not controlled by salt:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/492

#12 Updated by okurz 2 months ago

  • Status changed from In Progress to Blocked

last MR also merged. In grafana I can verify there is NTP data for both workers as well as the webUI host, unpaused alert. Commented in https://infra.nue.suse.com/SelfService/Display.html?id=187094 , waiting for feedback in ticket

#13 Updated by okurz 2 months ago

  • Related to action #92113: [Alerting] openqaworker-arm-3: NTP offset alert added

#14 Updated by okurz about 2 months ago

  • Due date deleted (2021-05-26)

#15 Updated by okurz about 2 months ago

  • Status changed from Blocked to Resolved

#92113 resolved, https://infra.nue.suse.com/SelfService/Display.html?id=187094 resolved . NTP alerts on side of EngInfra maintained icinga were disabled for all hosts that we monitor within the OSD infrastructure

Also available in: Atom PDF