action #162038
closedNo HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S
0%
Description
Observation¶
There were many logs for check whether clocks on the local host and the web UI host are in sync
.
Openqa seems unavailable for ~10min
Jun 10 00:43:01 openqa systemd[1]: Started Session c46901 of User geekotest.
Jun 10 00:43:01 openqa cron[13285]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3522 from worker_status (free)
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3464 from worker_status (free)
Jun 10 00:43:02 openqa openqa[32518]: [debug] Rejecting authentication for
user "openqaworker4" with ip "10.100.96.76", valid key "...",
secret "...", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Jun 10 00:43:02 openqa openqa[4972]: [debug] Rejecting authentication for user "openqaworker4" with ip "**", valid key "**", secret "**", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Jun 10 00:43:02 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 1952 from worker_status (working)
Jun 10 00:43:02 openqa openqa[745]: [debug] Rejecting authentication for user
...
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] Enqueued restarting openQA job 14563096 via Minion job 11664862
Jun 10 00:43:04 openqa openqa[1637]: [debug] [pid:1637] Enqueued restarting openQA job 14561971 via Minion job 11664863
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] _carry_over_candidate(14563096): _failure_reason=cleanup_before_shutdown:none,force_scheduled_tasks:none,grub_set_bootargs:none,hostname:none,shutdown:none
Jun 10 00:43:04 openqa openqa[745]: [debug] [pid:745] Enqueued restarting openQA job 14562006 via Minion job 11664864
Jun 10 00:43:04 openqa openqa[32535]: [debug] [pid:32535] _carry_over_candidate(14560798): _failure_reason=consoletest_finish:none,mariadb_srv:none,mtab:none,openvswitch:none,orphaned_packages_check:none,rsync:none,ssh_cleanup:none,sshd$
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557]
_carry_over_candidate(14563096): checking take over from 14558523:
_failure_reason=GOOD
chrony status looks ok on OSD.
I saw also a python3 update and then issues with telegraf but I dont think they are related with OSD.
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#162038
Suggestions¶
Consider outputting the actual "mismatch" value (the time delta exceeding some threshold) in the log message if possible and not hidden in a hashsum (https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/Controller/Auth.pm#L162)DONE: https://github.com/os-autoinst/openQA/pull/5782/files#diff-9c255afcba1b2076d54bd53576e2eb5a9f9f5ef8e92e25b899b0bc34b1fad738R103- Find out/fix why some jobs fail without a specific error besides "api failure" besides showing timestamp errors in logs
Crosscheck with #163592 if the symptoms actually still appearDONE: They do and both symptoms are strongly linked together- Try to simulate slow response times (by putting a sleep at the right place) to reproduce the issue (How slow can we go before "timestamp mismatch" becomes an issue?)
- If the real cause for the error message is actually not a time difference but too slow responses by the web UI extend the error message accordingly
Out of scope¶
Updated by nicksinger 6 months ago
That doesn't seem to be the first time this happens. I just dumped the whole journal (journalctl | grep "Rejecting authentication"
) and looked trough the messages with cat /tmp/failed_auth_all | cut -d " " -f-3 | cut -d ":" -f-2 | sort | uniq -c
:
249 Jun 03 00:49
360 Jun 03 00:55
1 Jun 03 01:02
351 Jun 10 00:43
408 Jun 10 00:49
70 Jun 10 10:27
The in the description mentioned worker, openqaworker4 is misleading because the account is actually called that way. The IP shows that this is in reality openqaworker18 and its clock looks good to me in that time frame: https://monitor.qa.suse.de/d/WDopenqaworker18/worker-dashboard-openqaworker18?orgId=1&from=1717972153700&to=1717979051591&viewPanel=65102
Unfortunately we don't have a similar panel for OSD itself. I think we had this issue in the past already and we just didn't notice. For sure room for improvement but I doubt it explains why OSD was unreachable for ~10 minutes.
These are the hosts which fail to connect:
openqa:~ # cat /tmp/failed_auth_all | cut -d \" -f 4 | sort | uniq | xargs -I{} dig -x {} +short
openqaworker14.qa.suse.cz.
openqaworker16.qa.suse.cz.
openqaworker17.qa.suse.cz.
openqaworker18.qa.suse.cz.
openqa.oqa.prg2.suse.org.
worker-arm2.oqa.prg2.suse.org.
worker-arm1.oqa.prg2.suse.org.
worker32.oqa.prg2.suse.org.
worker30.oqa.prg2.suse.org.
worker34.oqa.prg2.suse.org.
worker33.oqa.prg2.suse.org.
worker31.oqa.prg2.suse.org.
worker29.oqa.prg2.suse.org.
worker35.oqa.prg2.suse.org.
worker40.oqa.prg2.suse.org.
openqa-service.qe.prg2.suse.org.
Updated by nicksinger 6 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by nicksinger 6 months ago
- Subject changed from No HTTP Response on OSD on 10-06-2024 - "timestamp mismatch" to No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry
- Priority changed from Immediate to Urgent
Used dry_run=1 host=openqa.suse.de result="result='incomplete'" additional_filters="reason like '%mismatch%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs
and restarted 17 tests:
{"result":[{"14560669":14567718}],"test_url":[{"14560669":"\/tests\/14567718"}]}
{"result":[{"14560744":14567719}],"test_url":[{"14560744":"\/tests\/14567719"}]}
{"result":[{"14560727":14567720}],"test_url":[{"14560727":"\/tests\/14567720"}]}
{"result":[{"14563663":14567721,"14563664":14567722}],"test_url":[{"14563663":"\/tests\/14567721","14563664":"\/tests\/14567722"}]}
{"result":[{"14563629":14567723}],"test_url":[{"14563629":"\/tests\/14567723"}]}
{"result":[{"14563623":14567724,"14563624":14567725,"14563625":14567726,"14563626":14567727}],"test_url":[{"14563623":"\/tests\/14567724","14563624":"\/tests\/14567725","14563625":"\/tests\/14567726","14563626":"\/tests\/14567727"}]}
{"result":[{"14563641":14567728,"14563642":14567729,"14563643":14567730,"14563644":14567731}],"test_url":[{"14563641":"\/tests\/14567728","14563642":"\/tests\/14567729","14563643":"\/tests\/14567730","14563644":"\/tests\/14567731"}]}
{"result":[{"14563665":14567732}],"test_url":[{"14563665":"\/tests\/14567732"}]}
{"result":[{"14563666":14567733}],"test_url":[{"14563666":"\/tests\/14567733"}]}
{"result":[{"14563672":14567734}],"test_url":[{"14563672":"\/tests\/14567734"}]}
{"result":[{"14560652":14567735}],"test_url":[{"14560652":"\/tests\/14567735"}]}
{"result":[{"14563484":14567736}],"test_url":[{"14563484":"\/tests\/14567736"}]}
{"result":[{"14563528":14567737}],"test_url":[{"14563528":"\/tests\/14567737"}]}
{"result":[{"14560837":14567738}],"test_url":[{"14560837":"\/tests\/14567738"}]}
{"result":[{"14560756":14567739}],"test_url":[{"14560756":"\/tests\/14567739"}]}
{"result":[{"14563529":14567740}],"test_url":[{"14563529":"\/tests\/14567740"}]}
{"enforceable":1,"errors":["Job 14561378 misses the following mandatory assets: hdd\/publiccloud_5.5_GCE-BYOS-Updates-x86_64.qcow2\nYou may try to retrigger the parent job that should create the assets and will implicitly retrigger this job as well."],"result":[],"test_url":[]}
also added an auto_review expression to the title to mitigate further issues
Updated by openqa_review 6 months ago
- Due date set to 2024-06-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 6 months ago
- Status changed from In Progress to Resolved
I checked our DB again with sudo -u geekotest psql --no-align --tuples-only --command="select id from jobs where (result='incomplete' and t_finished >= '2024-06-11' and reason like '%mismatch%');" openqa
and can't find a single job failing. From an infra POV we currently can't come up with a solution or workaround. We think that the deployment in the middle of the day (Jun 10 10:27) caused people to find this issue more easily then before.
As I don't plan to change or improve anything here (except making deployment more stable in other ticket), I will close this now. When this becomes a bigger problem we might have to think about bigger architecture changes (e.g. get rid of the strict time dependency, make sure the updates don't slow down everything so clocks get out of sync, etc.)
Updated by okurz 5 months ago
- Related to action #162644: Job fails with api failure: 403 response: timestamp mismatch after brief osd outage added
Updated by okurz 5 months ago · Edited
- Due date deleted (
2024-06-26) - Status changed from Resolved to New
- Assignee deleted (
nicksinger)
Happened again. Merging #162644 with this one as there is more information here:
openQA test in scenario sle-15-SP5-Online-QR-x86_64-create_hdd_minimal_base+sdk@64bit fails in
welcome
The test failed with this reason: api failure: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
There were also reports of osd being inaccessible in https://suse.slack.com/archives/C02CANHLANP/p1718891972135439 around that time.
There was a ton of Updating seen of worker
and influxdb/telegraf timing out sudo journalctl -S '2024-06-20 15:50:00' -U '2024-06-20 16:00:00'
The outage is also seen on grafana: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1718889892165&to=1718893680225
Updated by okurz 5 months ago
- Related to action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M added
Updated by okurz 5 months ago · Edited
- Status changed from New to Blocked
- Assignee set to okurz
- Priority changed from Urgent to Normal
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1718890439196&to=1718894021805 from the same timeframe shows the same symptoms as from #159396 and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1718890439196&to=1718894021805&viewPanel=179 shows a very active pg_dump after the period of unresponsiveness. We need a proper solution in #159396 first.
Done for i in failed incomplete parallel_failed; do host=openqa.suse.de result="result='$i'" additional_filters="reason like '%mismatch%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs; done
Updated by okurz 5 months ago
- Status changed from Blocked to New
- Assignee deleted (
okurz) - Priority changed from Normal to High
#159396 was resolved. The symptom "timestamp mismatch" appeared again today, see https://suse.slack.com/archives/C02CANHLANP/p1720581833982129
Julie Cao Fail to clone test on OSD with Reason: api failure: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync, https://openqa.suse.de/tests/14873863#
Updated by livdywan 5 months ago
- Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Updated by livdywan 5 months ago
It's worth noting the following jobs hit this error without being picked up and restarted despite matching the regex:
- https://openqa.suse.de/tests/14878201
- https://openqa.suse.de/tests/14893311
- https://openqa.suse.de/tests/14881999
- https://openqa.suse.de/tests/14893685 says
api failure
only
None of the jobs seems to have worker or os-autoinst logs but that was also the case with previous occurrences.
Updated by livdywan 5 months ago · Edited
The most interesting event I could spot in sudo journalctl -S '2024-07-11 10:00:00' -U '2024-07-11 11:00:00
is this:
Jul 11 10:33:12 openqa systemd[1]: Started Session c67255 of User dehydrated
Several systemd messages about shutting down and restarting sessions right before those jobs failed.
Updated by livdywan 5 months ago · Edited
Apparently we're seeing more incompletes with different variants of "api error", and the journal contains multiple instances of this:
Jul 11 16:25:27 openqa wickedd[912]: restarted rtnetlink event listener
Jul 11 16:25:27 openqa wickedd[912]: rtnetlink event receive error: Out of memory (No buffer space available)
- https://openqa.suse.de/tests/14895382
api failure
- https://openqa.suse.de/tests/14895392
api failure
- https://openqa.suse.de/tests/14894713
api failure: Connection error: Premature connection close
- https://openqa.suse.de/tests/14895342
api failure: Connection error: Premature connection close
- https://openqa.suse.de/tests/14895539
api failure: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Updated by nicksinger 5 months ago
Some NTP/chrony-values from all machines as reference:
openqa:~ # chronyc tracking
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 13:57:27 2024
System time : 0.000022083 seconds fast of NTP time
Last offset : +0.000024303 seconds
RMS offset : 0.000066092 seconds
Frequency : 27.581 ppm slow
Residual freq : +0.001 ppm
Skew : 0.039 ppm
Root delay : 0.001308891 seconds
Root dispersion : 0.004389539 seconds
Update interval : 1026.1 seconds
Leap status : Normal
openqa:~ # salt --no-color \* cmd.run 'chronyc tracking'
openqa.suse.de:
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 13:57:27 2024
System time : 0.000021778 seconds fast of NTP time
Last offset : +0.000024303 seconds
RMS offset : 0.000066092 seconds
Frequency : 27.581 ppm slow
Residual freq : +0.001 ppm
Skew : 0.039 ppm
Root delay : 0.001308891 seconds
Root dispersion : 0.004403387 seconds
Update interval : 1026.1 seconds
Leap status : Normal
qamaster.qe.nue2.suse.org:
Reference ID : C3C98961 (defiant.tlercher.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:15 2024
System time : 0.000019144 seconds slow of NTP time
Last offset : -0.000010651 seconds
RMS offset : 0.000040153 seconds
Frequency : 25.721 ppm fast
Residual freq : +0.010 ppm
Skew : 0.066 ppm
Root delay : 0.004941278 seconds
Root dispersion : 0.000637016 seconds
Update interval : 64.2 seconds
Leap status : Normal
qesapworker-prg4.qa.suse.cz:
Reference ID : 4E6C60C5 (vip197.czela.net)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:35:23 2024
System time : 0.000043590 seconds slow of NTP time
Last offset : -0.000018665 seconds
RMS offset : 0.000030183 seconds
Frequency : 12.469 ppm slow
Residual freq : +0.000 ppm
Skew : 0.011 ppm
Root delay : 0.002343056 seconds
Root dispersion : 0.001813403 seconds
Update interval : 513.1 seconds
Leap status : Normal
backup-vm.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:24 2024
System time : 0.000067627 seconds slow of NTP time
Last offset : +0.000029119 seconds
RMS offset : 0.000088596 seconds
Frequency : 25.691 ppm fast
Residual freq : +0.000 ppm
Skew : 0.024 ppm
Root delay : 0.005426399 seconds
Root dispersion : 0.002315616 seconds
Update interval : 1035.6 seconds
Leap status : Normal
openqaworker1.qe.nue2.suse.org:
Reference ID : D5EFEA1C (server1b.meinberg.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:01 2024
System time : 0.000018288 seconds slow of NTP time
Last offset : -0.000010778 seconds
RMS offset : 0.000012211 seconds
Frequency : 55.573 ppm slow
Residual freq : -0.007 ppm
Skew : 0.062 ppm
Root delay : 0.004088305 seconds
Root dispersion : 0.000172256 seconds
Update interval : 130.0 seconds
Leap status : Normal
tumblesle.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:30:25 2024
System time : 0.000177187 seconds slow of NTP time
Last offset : -0.000153939 seconds
RMS offset : 0.000094980 seconds
Frequency : 25.823 ppm fast
Residual freq : -0.003 ppm
Skew : 0.024 ppm
Root delay : 0.005508567 seconds
Root dispersion : 0.002260174 seconds
Update interval : 1039.8 seconds
Leap status : Normal
storage.qe.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:16 2024
System time : 0.000027968 seconds slow of NTP time
Last offset : -0.000006329 seconds
RMS offset : 0.000016318 seconds
Frequency : 38.513 ppm slow
Residual freq : -0.000 ppm
Skew : 0.009 ppm
Root delay : 0.001057816 seconds
Root dispersion : 0.001494096 seconds
Update interval : 516.7 seconds
Leap status : Normal
qesapworker-prg5.qa.suse.cz:
Reference ID : 4E6C60C5 (vip197.czela.net)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:38:03 2024
System time : 0.000044509 seconds fast of NTP time
Last offset : +0.000044949 seconds
RMS offset : 0.000075088 seconds
Frequency : 15.335 ppm slow
Residual freq : +0.001 ppm
Skew : 0.010 ppm
Root delay : 0.002453622 seconds
Root dispersion : 0.001443175 seconds
Update interval : 1036.7 seconds
Leap status : Normal
imagetester.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:29:53 2024
System time : 0.000046080 seconds fast of NTP time
Last offset : -0.000012911 seconds
RMS offset : 0.000062851 seconds
Frequency : 66.388 ppm fast
Residual freq : -0.000 ppm
Skew : 0.017 ppm
Root delay : 0.005378101 seconds
Root dispersion : 0.002174454 seconds
Update interval : 1037.0 seconds
Leap status : Normal
jenkins.qe.nue2.suse.org:
Reference ID : 83BC03DD (ntp1.rrze.uni-erlangen.de)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:29:32 2024
System time : 0.000091726 seconds slow of NTP time
Last offset : +0.000001189 seconds
RMS offset : 0.000068897 seconds
Frequency : 25.711 ppm fast
Residual freq : +0.001 ppm
Skew : 0.066 ppm
Root delay : 0.006706720 seconds
Root dispersion : 0.002170046 seconds
Update interval : 514.7 seconds
Leap status : Normal
openqaw5-xen.qe.prg2.suse.org:
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:41:14 2024
System time : 0.000026322 seconds fast of NTP time
Last offset : +0.000008645 seconds
RMS offset : 0.000013326 seconds
Frequency : 19.957 ppm slow
Residual freq : +0.000 ppm
Skew : 0.009 ppm
Root delay : 0.001055759 seconds
Root dispersion : 0.001510400 seconds
Update interval : 258.2 seconds
Leap status : Normal
baremetal-support.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:39:59 2024
System time : 0.000289661 seconds slow of NTP time
Last offset : -0.000209298 seconds
RMS offset : 0.000106703 seconds
Frequency : 25.723 ppm fast
Residual freq : -0.017 ppm
Skew : 0.075 ppm
Root delay : 0.005447044 seconds
Root dispersion : 0.002248233 seconds
Update interval : 1032.8 seconds
Leap status : Normal
diesel.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:37:41 2024
System time : 0.000083468 seconds slow of NTP time
Last offset : +0.000029892 seconds
RMS offset : 0.000053862 seconds
Frequency : 12.831 ppm slow
Residual freq : +0.000 ppm
Skew : 0.014 ppm
Root delay : 0.005361831 seconds
Root dispersion : 0.001741083 seconds
Update interval : 514.3 seconds
Leap status : Normal
openqaworker16.qa.suse.cz:
Reference ID : 4E50AC85 (78.80.172.133)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:36:10 2024
System time : 0.000095464 seconds fast of NTP time
Last offset : +0.000010597 seconds
RMS offset : 0.000050793 seconds
Frequency : 30.884 ppm slow
Residual freq : +0.003 ppm
Skew : 0.042 ppm
Root delay : 0.003768685 seconds
Root dispersion : 0.000559975 seconds
Update interval : 258.1 seconds
Leap status : Normal
worker35.oqa.prg2.suse.org:
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:45 2024
System time : 0.000008578 seconds slow of NTP time
Last offset : -0.000014809 seconds
RMS offset : 0.000054246 seconds
Frequency : 12.161 ppm slow
Residual freq : -0.022 ppm
Skew : 0.152 ppm
Root delay : 0.001055162 seconds
Root dispersion : 0.001556702 seconds
Update interval : 64.3 seconds
Leap status : Normal
sapworker1.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:41 2024
System time : 0.000138152 seconds fast of NTP time
Last offset : +0.000006250 seconds
RMS offset : 0.000065712 seconds
Frequency : 2.265 ppm fast
Residual freq : +0.001 ppm
Skew : 0.066 ppm
Root delay : 0.005746688 seconds
Root dispersion : 0.001819121 seconds
Update interval : 258.2 seconds
Leap status : Normal
monitor.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:33:43 2024
System time : 0.000173209 seconds slow of NTP time
Last offset : -0.000073582 seconds
RMS offset : 0.000067978 seconds
Frequency : 25.700 ppm fast
Residual freq : -0.001 ppm
Skew : 0.031 ppm
Root delay : 0.005451505 seconds
Root dispersion : 0.002243097 seconds
Update interval : 1041.1 seconds
Leap status : Normal
worker32.oqa.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:59 2024
System time : 0.000022194 seconds fast of NTP time
Last offset : +0.000006747 seconds
RMS offset : 0.000010521 seconds
Frequency : 20.574 ppm slow
Residual freq : +0.000 ppm
Skew : 0.017 ppm
Root delay : 0.001044271 seconds
Root dispersion : 0.002068724 seconds
Update interval : 64.7 seconds
Leap status : Normal
openqaworker18.qa.suse.cz:
Reference ID : D5C036E8 (213.192.54.232)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:41:46 2024
System time : 0.000052862 seconds slow of NTP time
Last offset : -0.000002611 seconds
RMS offset : 0.000075304 seconds
Frequency : 36.898 ppm slow
Residual freq : -0.001 ppm
Skew : 0.042 ppm
Root delay : 0.002779522 seconds
Root dispersion : 0.000671763 seconds
Update interval : 260.3 seconds
Leap status : Normal
sapworker3.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:27:09 2024
System time : 0.000080542 seconds fast of NTP time
Last offset : +0.000005551 seconds
RMS offset : 0.000063281 seconds
Frequency : 3.990 ppm fast
Residual freq : +0.001 ppm
Skew : 0.024 ppm
Root delay : 0.005348192 seconds
Root dispersion : 0.002241552 seconds
Update interval : 1031.0 seconds
Leap status : Normal
mania.qe.nue2.suse.org:
Reference ID : C03C5811 (ntp2.m-online.net)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:14 2024
System time : 0.000005734 seconds slow of NTP time
Last offset : -0.000007869 seconds
RMS offset : 0.000026844 seconds
Frequency : 23.346 ppm slow
Residual freq : -0.001 ppm
Skew : 0.027 ppm
Root delay : 0.005080119 seconds
Root dispersion : 0.000471100 seconds
Update interval : 259.6 seconds
Leap status : Normal
openqaworker17.qa.suse.cz:
Reference ID : 0501387B (ntp.suas.cz)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:36:29 2024
System time : 0.000211145 seconds slow of NTP time
Last offset : -0.000442668 seconds
RMS offset : 0.000230532 seconds
Frequency : 38.706 ppm slow
Residual freq : -0.032 ppm
Skew : 0.277 ppm
Root delay : 0.006446593 seconds
Root dispersion : 0.000680730 seconds
Update interval : 1041.3 seconds
Leap status : Normal
schort-server.qe.nue2.suse.org:
Reference ID : C3C98961 (defiant.tlercher.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:19 2024
System time : 0.000065558 seconds fast of NTP time
Last offset : +0.000026393 seconds
RMS offset : 0.000062144 seconds
Frequency : 25.702 ppm fast
Residual freq : -0.000 ppm
Skew : 0.020 ppm
Root delay : 0.005218701 seconds
Root dispersion : 0.001101224 seconds
Update interval : 130.0 seconds
Leap status : Normal
sapworker2.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:01 2024
System time : 0.000100719 seconds fast of NTP time
Last offset : +0.000037721 seconds
RMS offset : 0.000043879 seconds
Frequency : 0.090 ppm fast
Residual freq : +0.001 ppm
Skew : 0.027 ppm
Root delay : 0.005470708 seconds
Root dispersion : 0.001831063 seconds
Update interval : 512.1 seconds
Leap status : Normal
petrol.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:26:40 2024
System time : 0.000170613 seconds slow of NTP time
Last offset : -0.000278480 seconds
RMS offset : 0.000129630 seconds
Frequency : 18.311 ppm slow
Residual freq : -0.010 ppm
Skew : 0.070 ppm
Root delay : 0.005347948 seconds
Root dispersion : 0.003424126 seconds
Update interval : 1043.3 seconds
Leap status : Normal
ada.qe.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:36:27 2024
System time : 0.000016693 seconds fast of NTP time
Last offset : -0.000013316 seconds
RMS offset : 0.000014862 seconds
Frequency : 16.582 ppm fast
Residual freq : -0.001 ppm
Skew : 0.011 ppm
Root delay : 0.001102176 seconds
Root dispersion : 0.001735037 seconds
Update interval : 513.4 seconds
Leap status : Normal
worker-arm1.oqa.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:26:24 2024
System time : 0.000007647 seconds fast of NTP time
Last offset : +0.000080160 seconds
RMS offset : 0.000034377 seconds
Frequency : 5.537 ppm slow
Residual freq : +0.005 ppm
Skew : 0.024 ppm
Root delay : 0.001136960 seconds
Root dispersion : 0.002377458 seconds
Update interval : 517.0 seconds
Leap status : Normal
openqaworker14.qa.suse.cz:
Reference ID : 4E6C60C5 (vip197.czela.net)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:36:43 2024
System time : 0.000310024 seconds slow of NTP time
Last offset : -0.000359515 seconds
RMS offset : 0.000177298 seconds
Frequency : 7.105 ppm slow
Residual freq : -0.016 ppm
Skew : 0.115 ppm
Root delay : 0.002246775 seconds
Root dispersion : 0.002097186 seconds
Update interval : 2077.0 seconds
Leap status : Normal
osiris-1.qe.nue2.suse.org:
Reference ID : 4F852C8E (79.133.44.142)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:40:34 2024
System time : 0.000006179 seconds slow of NTP time
Last offset : -0.000007087 seconds
RMS offset : 0.000014982 seconds
Frequency : 11.098 ppm fast
Residual freq : -0.000 ppm
Skew : 0.018 ppm
Root delay : 0.003305381 seconds
Root dispersion : 0.000208739 seconds
Update interval : 259.5 seconds
Leap status : Normal
openqaworker-arm-1.qe.nue2.suse.org:
Reference ID : BF2D4343 (server1b.meinberg.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:41:28 2024
System time : 0.000093359 seconds slow of NTP time
Last offset : -0.000007283 seconds
RMS offset : 0.000033947 seconds
Frequency : 24.445 ppm slow
Residual freq : -0.001 ppm
Skew : 0.024 ppm
Root delay : 0.004061557 seconds
Root dispersion : 0.000189946 seconds
Update interval : 515.2 seconds
Leap status : Normal
backup-qam.qe.nue2.suse.org:
Reference ID : C3C98961 (defiant.tlercher.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:24 2024
System time : 0.000006009 seconds fast of NTP time
Last offset : +0.000001472 seconds
RMS offset : 0.000033058 seconds
Frequency : 19.443 ppm fast
Residual freq : -0.000 ppm
Skew : 0.055 ppm
Root delay : 0.005161884 seconds
Root dispersion : 0.000541619 seconds
Update interval : 65.0 seconds
Leap status : Normal
openqa-piworker.qe.nue2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:40:42 2024
System time : 0.000003187 seconds fast of NTP time
Last offset : -0.000018070 seconds
RMS offset : 0.000012431 seconds
Frequency : 9.311 ppm fast
Residual freq : -0.010 ppm
Skew : 0.059 ppm
Root delay : 0.005354426 seconds
Root dispersion : 0.001506209 seconds
Update interval : 258.0 seconds
Leap status : Normal
unreal6.qe.nue2.suse.org:
Reference ID : D5AC696A (ns1.blazing.de)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:30:00 2024
System time : 0.000075821 seconds fast of NTP time
Last offset : +0.000169504 seconds
RMS offset : 0.000227368 seconds
Frequency : 27.267 ppm fast
Residual freq : +0.055 ppm
Skew : 0.156 ppm
Root delay : 0.003834251 seconds
Root dispersion : 0.002034988 seconds
Update interval : 1029.7 seconds
Leap status : Normal
qesapworker-prg6.qa.suse.cz:
Reference ID : 4E6C60C5 (vip197.czela.net)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:34:48 2024
System time : 0.000066652 seconds fast of NTP time
Last offset : +0.000069314 seconds
RMS offset : 0.000084554 seconds
Frequency : 7.334 ppm slow
Residual freq : +0.004 ppm
Skew : 0.018 ppm
Root delay : 0.002232490 seconds
Root dispersion : 0.001841876 seconds
Update interval : 2080.7 seconds
Leap status : Normal
worker29.oqa.prg2.suse.org:
Reference ID : AEA9312C (2a07:de40:b205:26:10:144:55:130)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:23 2024
System time : 0.000015936 seconds fast of NTP time
Last offset : +0.000011206 seconds
RMS offset : 0.000020762 seconds
Frequency : 11.054 ppm slow
Residual freq : +0.002 ppm
Skew : 0.061 ppm
Root delay : 0.003176525 seconds
Root dispersion : 0.000410482 seconds
Update interval : 65.1 seconds
Leap status : Normal
qesapworker-prg7.qa.suse.cz:
Reference ID : 4E6C60C5 (vip197.czela.net)
Stratum : 2
Ref time (UTC) : Thu Jul 11 14:36:00 2024
System time : 0.000053961 seconds fast of NTP time
Last offset : +0.000056638 seconds
RMS offset : 0.000071415 seconds
Frequency : 5.603 ppm slow
Residual freq : +0.000 ppm
Skew : 0.009 ppm
Root delay : 0.002314058 seconds
Root dispersion : 0.001585285 seconds
Update interval : 2074.7 seconds
Leap status : Normal
grenache-1.oqa.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:30:33 2024
System time : 0.000051045 seconds fast of NTP time
Last offset : +0.000001856 seconds
RMS offset : 0.000103933 seconds
Frequency : 7.286 ppm slow
Residual freq : -0.000 ppm
Skew : 0.029 ppm
Root delay : 0.001180847 seconds
Root dispersion : 0.002281780 seconds
Update interval : 1034.7 seconds
Leap status : Normal
worker-arm2.oqa.prg2.suse.org:
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:39:56 2024
System time : 0.000105552 seconds slow of NTP time
Last offset : -0.000024404 seconds
RMS offset : 0.000042628 seconds
Frequency : 7.849 ppm slow
Residual freq : -0.001 ppm
Skew : 0.013 ppm
Root delay : 0.001032133 seconds
Root dispersion : 0.001510956 seconds
Update interval : 256.7 seconds
Leap status : Normal
s390zl12.oqa.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 13:53:31 2024
System time : 0.000012752 seconds slow of NTP time
Last offset : +0.000000963 seconds
RMS offset : 0.000022999 seconds
Frequency : 0.050 ppm fast
Residual freq : -0.000 ppm
Skew : 0.008 ppm
Root delay : 0.001468576 seconds
Root dispersion : 0.004334040 seconds
Update interval : 517.5 seconds
Leap status : Normal
worker33.oqa.prg2.suse.org:
Reference ID : 0A903781 (kronos01.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:38:59 2024
System time : 0.000001046 seconds fast of NTP time
Last offset : -0.000000740 seconds
RMS offset : 0.000018752 seconds
Frequency : 15.258 ppm slow
Residual freq : -0.000 ppm
Skew : 0.017 ppm
Root delay : 0.001051913 seconds
Root dispersion : 0.001512931 seconds
Update interval : 516.7 seconds
Leap status : Normal
worker34.oqa.prg2.suse.org:
Reference ID : 0A903782 (kronos02.prg2.suse.org)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:43:08 2024
System time : 0.000043618 seconds fast of NTP time
Last offset : +0.000027052 seconds
RMS offset : 0.000025032 seconds
Frequency : 7.000 ppm slow
Residual freq : +0.030 ppm
Skew : 0.169 ppm
Root delay : 0.003190971 seconds
Root dispersion : 0.000613747 seconds
Update interval : 130.4 seconds
Leap status : Normal
worker30.oqa.prg2.suse.org:
Reference ID : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:40:31 2024
System time : 0.000018638 seconds slow of NTP time
Last offset : -0.000069619 seconds
RMS offset : 0.000037803 seconds
Frequency : 14.052 ppm slow
Residual freq : -0.063 ppm
Skew : 0.194 ppm
Root delay : 0.001035445 seconds
Root dispersion : 0.001595671 seconds
Update interval : 130.4 seconds
Leap status : Normal
worker40.oqa.prg2.suse.org:
Reference ID : AEA9312C (2a07:de40:b205:26:10:144:55:130)
Stratum : 3
Ref time (UTC) : Thu Jul 11 14:42:35 2024
System time : 0.000030772 seconds fast of NTP time
Last offset : +0.000001950 seconds
RMS offset : 0.000130650 seconds
Frequency : 7.392 ppm slow
Residual freq : -0.001 ppm
Skew : 0.087 ppm
Root delay : 0.003208488 seconds
Root dispersion : 0.000277519 seconds
Update interval : 65.1 seconds
Leap status : Normal
s390zl13.oqa.prg2.suse.org:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20240711144328940954
ERROR: Minions returned with non-zero exit code
The "System time" and the therefore reported NTP offset seems to be way too low to cause any problems. We usually saw api authentication issues with time skew in the range of multiple seconds.
Updated by livdywan 5 months ago
livdywan wrote in #note-17:
dry_run=1 host=openqa.suse.de result="result='incomplete'" additional_filters="reason like '%api failure%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs
Retriggering jobs with api failure
to mitigate while the cause is still unclear.
Updated by livdywan 5 months ago
- Has duplicate action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S added
Updated by livdywan 5 months ago
A few kernel jobs have failed during upload phase with quite non-descript reason: "api failure". As a result, there's no autoinst-log.txt nor worker-log.txt.
https://openqa.suse.de/tests/14897579
https://openqa.suse.de/tests/14897580
https://openqa.suse.de/tests/14895759
Updated by MDoucha 5 months ago
More examples of unspecified "api failure":
https://openqa.suse.de/tests/14897576
https://openqa.suse.de/tests/14882310
https://openqa.suse.de/tests/14895787
https://openqa.suse.de/tests/14895760 - with logs
https://openqa.suse.de/tests/14896597 - also with logs
Updated by livdywan 5 months ago
- Status changed from In Progress to New
- Assignee deleted (
livdywan)
MDoucha wrote in #note-24:
More examples of unspecified "api failure":
Thank you for these examples. The last two especially seem to confirm it's related to uploading of assets even when the job is otherwise running fine:
https://openqa.suse.de/tests/14895760 - with logs
[2024-07-11T17:03:20.278926+02:00] [debug] [pid:8428] backend process exited: 0
Use of uninitialized value $name in pattern match (m//) at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 332.
OpenQA::Isotovideo::Utils::handle_generated_assets(OpenQA::Isotovideo::CommandHandler=HASH(0x55c503301a38), 1) called at /usr/bin/isotovideo line 152
main::handle_shutdown() called at /usr/bin/isotovideo line 187
eval {...} called at /usr/bin/isotovideo line 176
[2024-07-11T17:03:20.383373+02:00] [debug] [pid:8428] stopping backend process 8637
[...]
[2024-07-11T17:07:04.861866+02:00] [info] [pid:27937] Uploading textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170344+02:00] [error] [pid:27937] All 60 upload attempts have failed for textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170616+02:00] [error] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: Forbidden
[2024-07-11T17:15:12.171535+02:00] [debug] [pid:64848] Upload concluded (no current module)
[2024-07-11T17:15:21.169877+02:00] [warn] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
[2024-07-11T17:15:21.170437+02:00] [error] [pid:27937] All 60 upload attempts have failed for textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170670+02:00] [error] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: Forbidden
[2024-07-11T17:15:21.191433+02:00] [info] [pid:27937] Uploading video.webm
https://openqa.suse.de/tests/14896597 - also with logs
[2024-07-11T17:19:06.909676+02:00] [info] [pid:30134] Result: done
[2024-07-11T17:19:06.914566+02:00] [info] [pid:44561] Uploading patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193604+02:00] [error] [pid:44561] All 60 upload attempts have failed for patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193858+02:00] [error] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: Forbidden
[2024-07-11T17:28:00.201425+02:00] [info] [pid:44561] Uploading video.webm
[2024-07-11T17:27:52.117556+02:00] [debug] [pid:30134] REST-API call: POST http://openqa.suse.de/api/v1/jobs/14896597/status
[2024-07-11T17:27:52.653861+02:00] [debug] [pid:30134] Upload concluded (no current module)
[2024-07-11T17:28:00.193340+02:00] [warn] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
[2024-07-11T17:28:00.193681+02:00] [error] [pid:44561] All 60 upload attempts have failed for patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193921+02:00] [error] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: Forbidden
[2024-07-11T17:28:00.201382+02:00] [info] [pid:44561] Uploading video.webm
I suggest to estimate this ticket and decide how to best handle these.
Updated by okurz 5 months ago
- Subject changed from No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry to No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz 5 months ago
- Priority changed from High to Urgent
Last seen today: https://suse.slack.com/archives/C02CANHLANP/p1721002348319279?thread_ts=1721002348.319279&cid=C02CANHLANP
there are several failures during the weekend, for example: https://openqa.suse.de/tests/14930578
Updated by ph03nix 4 months ago · Edited
I've observed also the following failures today:
- https://openqa.suse.de/tests/14927559
- https://openqa.suse.de/tests/14927504
- https://openqa.suse.de/tests/14927512
- https://openqa.suse.de/tests/14927523
- https://openqa.suse.de/tests/14927554
- https://openqa.suse.de/tests/14928085
- https://openqa.suse.de/tests/14928069
- https://openqa.suse.de/tests/14928061
- https://openqa.suse.de/tests/14928077
- https://openqa.suse.de/tests/14928067
- https://openqa.suse.de/tests/14927978
Here the following workers are affected:
- worker35
- sapworker1
- worker30
- worker29
Updated by mkittler 4 months ago
Also see #163592#note-34 for an update and how to query for relevant jobs. I have also restarted relevant jobs as part of this ticket.
Updated by nicksinger 4 months ago
- Status changed from Workable to In Progress
- Assignee set to nicksinger
Updated by openqa_review 4 months ago
- Due date set to 2024-07-30
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 4 months ago
- Priority changed from Urgent to High
openqa=> select count(id), min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, reason from jobs where t_finished >= '2024-07-16T08:00:00' and result = 'incomplete' group by reason order by count(id) desc;
count | first_t_finished | last_t_finished | reason
-------+---------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
25 | 2024-07-16 08:14:51 | 2024-07-17 03:00:12 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
23 | 2024-07-16 08:48:03 | 2024-07-16 15:38:38 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5907>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
23 | 2024-07-16 08:42:46 | 2024-07-16 15:55:46 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
20 | 2024-07-16 10:56:23 | 2024-07-17 04:39:37 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
18 | 2024-07-16 18:35:12 | 2024-07-16 18:56:12 | asset failure: Failed to download SLE-15-SP4-Full-ppc64le-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-ppc64le-Build190.2-Media1.iso.sha256
16 | 2024-07-16 17:49:27 | 2024-07-17 02:01:31 | tests died: unable to load main.pm, check the log for the cause (e.g. syntax error)
13 | 2024-07-16 17:02:18 | 2024-07-17 02:19:48 | backend died: QEMU exited unexpectedly, see log for details
10 | 2024-07-16 11:00:58 | 2024-07-17 07:17:11 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5903>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
10 | 2024-07-16 17:12:07 | 2024-07-16 17:48:14 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build207.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build207.1-Media1.iso
10 | 2024-07-16 17:19:03 | 2024-07-16 18:13:39 | asset failure: Failed to download SLE-15-SP4-Full-x86_64-Build208.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-x86_64-Build208.1-Media1.iso
9 | 2024-07-16 08:06:23 | 2024-07-17 05:02:10 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5902>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
8 | 2024-07-16 17:58:39 | 2024-07-16 17:59:35 | asset failure: Failed to download SLE-15-SP4-Full-aarch64-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-aarch64-Build208.1-Media1.iso.sha256
8 | 2024-07-16 17:49:19 | 2024-07-16 17:52:40 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build207.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build207.1-Media1.iso
8 | 2024-07-16 09:14:22 | 2024-07-16 09:46:59 | quit: worker has been stopped or restarted
8 | 2024-07-16 17:49:00 | 2024-07-16 18:00:12 | asset failure: Failed to download SLE-15-SP4-Full-s390x-Build208.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-s390x-Build208.1-Media1.iso
7 | 2024-07-16 08:08:31 | 2024-07-16 10:44:56 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5904>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
6 | 2024-07-17 02:30:00 | 2024-07-17 02:46:23 | tests died: unable to load tests/console/validate_raid_encrypt_home.pm, check the log for the cause (e.g. syntax error)
5 | 2024-07-16 12:44:27 | 2024-07-17 08:29:55 | backend died: QEMU terminated before QMP connection could be established. Check for errors below
4 | 2024-07-17 06:31:22 | 2024-07-17 07:19:18 | asset failure: Failed to download SLES-15-SP5-x86_64-create_hdd_yast_maintenance_desktop_deregistered.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-x86_64-create_hdd_yast_maintenance_desktop_deregistered.qcow2
4 | 2024-07-16 17:11:22 | 2024-07-16 18:26:03 | asset failure: Failed to download SLE-15-SP4-Online-ppc64le-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build208.1-Media1.iso.sha256
4 | 2024-07-16 17:39:36 | 2024-07-16 17:58:40 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build191.4-Media1.iso.sha256
4 | 2024-07-16 17:02:01 | 2024-07-16 17:02:56 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2
3 | 2024-07-16 17:01:39 | 2024-07-16 17:02:10 | asset failure: Failed to download sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2
3 | 2024-07-16 09:53:10 | 2024-07-16 10:39:59 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5908>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
2 | 2024-07-16 23:22:35 | 2024-07-17 01:23:02 | asset failure: Failed to download SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256
2 | 2024-07-16 17:38:41 | 2024-07-16 17:38:49 | asset failure: Failed to download SLE-15-SP4-Full-x86_64-Build195.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-x86_64-Build195.1-Media1.iso
2 | 2024-07-16 17:48:09 | 2024-07-16 17:48:10 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build190.2-Media1.iso.sha256
2 | 2024-07-16 15:08:08 | 2024-07-16 15:08:13 | asset failure: Failed to download SLE-12-SP5-x86_64-Build:34795:saptune-qam-sles4sap-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-x86_64-Build:34795:saptune-qam-sles4sap-gnome.qcow2
2 | 2024-07-16 17:45:05 | 2024-07-16 17:45:07 | asset failure: Failed to download SLE-15-SP4-Online-s390x-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build191.4-Media1.iso.sha256
2 | 2024-07-16 17:39:34 | 2024-07-16 17:39:42 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build190.2-Media1.iso.sha256
2 | 2024-07-16 18:04:17 | 2024-07-16 19:22:34 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build191.4-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build191.4-Media1.iso
2 | 2024-07-16 17:34:55 | 2024-07-16 17:34:57 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build195.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build195.1-Media1.iso.sha256
2 | 2024-07-16 10:13:02 | 2024-07-16 12:10:50 | asset failure: Failed to download SLES-15-SP3-x86_64-mru-install-minimal-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-mru-install-minimal-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2
2 | 2024-07-16 22:03:42 | 2024-07-16 22:05:13 | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20240716-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20240716-1-Server-DVD-Updates-64bit.qcow2
2 | 2024-07-16 17:43:07 | 2024-07-16 17:43:14 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit-amd.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit-amd.qcow2
2 | 2024-07-16 17:43:44 | 2024-07-16 17:43:52 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit.qcow2
2 | 2024-07-16 17:36:55 | 2024-07-16 17:36:56 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit-amd.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit-amd.qcow2
2 | 2024-07-16 17:39:09 | 2024-07-16 17:39:12 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit.qcow2
2 | 2024-07-16 18:07:48 | 2024-07-16 18:07:51 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@uefi.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@uefi.qcow2
2 | 2024-07-16 17:34:37 | 2024-07-16 17:34:40 | asset failure: Failed to download sle-15-SP4-x86_64-207.1-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-207.1-textmode_ay_qesec@64bit.qcow2
2 | 2024-07-16 15:03:44 | 2024-07-16 16:12:30 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
2 | 2024-07-17 06:10:10 | 2024-07-17 07:07:21 | tests died: unable to load tests/autoyast/installation.pm, check the log for the cause (e.g. syntax error)
2 | 2024-07-16 21:31:54 | 2024-07-16 22:09:05 | tests died: unable to load tests/network/samba/samba_adcli.pm, check the log for the cause (e.g. syntax error)
1 | 2024-07-16 14:48:03 | 2024-07-16 14:48:03 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
1 | 2024-07-16 15:57:07 | 2024-07-16 15:57:07 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5907>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
1 | 2024-07-16 12:39:26 | 2024-07-16 12:39:26 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5903>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
1 | 2024-07-16 17:01:53 | 2024-07-16 17:01:53 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode_ay_qesec@64bit.qcow2
1 | 2024-07-16 17:02:22 | 2024-07-16 17:02:22 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-gnome@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-gnome@64bit-fips.qcow2
1 | 2024-07-17 08:34:41 | 2024-07-17 08:34:41 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30097. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 01:54:13 | 2024-07-17 01:54:13 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30478. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:03:35 | 2024-07-17 03:03:35 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30701. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:27:17 | 2024-07-17 03:27:17 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 31505. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 08:33:49 | 2024-07-17 08:33:49 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 32030. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:00:56 | 2024-07-17 03:00:56 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 32264. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:04:23 | 2024-07-17 03:04:23 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34150. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:27:18 | 2024-07-17 03:27:18 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34451. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 03:29:34 | 2024-07-17 03:29:34 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34695. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 01:27:40 | 2024-07-17 01:27:40 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 35376. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 01:31:48 | 2024-07-17 01:31:48 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 35532. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 01:58:31 | 2024-07-17 01:58:31 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 36968. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
1 | 2024-07-17 01:06:55 | 2024-07-17 01:06:55 | backend died: Virtio terminal and svirt serial terminal do not support send_key. Use
1 | 2024-07-17 08:53:08 | 2024-07-17 08:53:08 | backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 842.
1 | 2024-07-16 23:24:07 | 2024-07-16 23:24:07 | cache failure: Failed to download SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 [Auto-restarting because reason matches the configured "auto_clone_regex".]
1 | 2024-07-16 18:09:47 | 2024-07-16 18:09:47 | cache failure: Failed to download SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 [Auto-restarting because reason matches the configured "auto_clone_regex".]
1 | 2024-07-16 17:01:38 | 2024-07-16 17:01:38 | cache failure: Failed to download sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 [Auto-restarting because reason matches the configured "auto_clone_regex".]
1 | 2024-07-16 17:02:33 | 2024-07-16 17:02:33 | cache failure: Failed to download sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 [Auto-restarting because reason matches the configured "auto_clone_regex".]
1 | 2024-07-16 21:27:57 | 2024-07-16 21:27:57 | asset failure: Failed to download autoyast_SLES-12-SP5-x86_64-create_hdd_yast_maintenance_desktop-Build20240716-1.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-12-SP5-x86_64-create_hdd_yast_maintenance_desktop-Build20240716-1.qcow2
1 | 2024-07-17 07:42:07 | 2024-07-17 07:42:07 | asset failure: Failed to download SLES-15-SP5-x86_64-mru-install-desktop-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-x86_64-mru-install-desktop-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2
1 | 2024-07-16 18:10:02 | 2024-07-16 18:10:02 | asset failure: Failed to download SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256
1 | 2024-07-17 01:13:57 | 2024-07-17 01:13:57 | backend died: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
1 | 2024-07-17 01:07:21 | 2024-07-17 01:07:21 | backend died: Error connecting to VNC server <s390kvm096.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
1 | 2024-07-16 12:07:21 | 2024-07-16 12:07:21 | tests died: unable to load tests/console/perl_bootloader.pm, check the log for the cause (e.g. syntax error)
Quite some incomplete but not a single "api failure: 403 response: timestamp mismatch" (2 days ago: https://progress.opensuse.org/issues/163592#note-34) - so the auto-review works and restarts affected jobs. We can lower the prio then.
Updated by nicksinger 4 months ago
livdywan wrote in #note-35:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228 was merged
Thanks!
I had a talk yesterday with @kraih (https://suse.slack.com/archives/C02AJ1E568M/p1721229859258029) and we're actually not sure what will happen if we add this header a second time. I will now closely monitor OSD to see if we encounter problems
Updated by ybonatakis 4 months ago
I tried to restart nginx today and it fails. The logs took me to the conf file which points here.
Jul 19 07:34:35 openqa nginx[28217]: nginx: [emerg] unknown directive "perl_set" in /etc/nginx/nginx.conf:30
Jul 19 07:34:35 openqa nginx[28217]: nginx: configuration file /etc/nginx/nginx.conf test failed
Jul 19 07:34:35 openqa systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Updated by ybonatakis 4 months ago
Nick I manual revert the changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228/diffs in order to restart the nginx and bring back the server. please take a look
Updated by ph03nix 4 months ago
The issue is back:
- https://openqa.suse.de/tests/14976907
- https://openqa.suse.de/tests/14976974
- https://openqa.suse.de/tests/14976918
- https://openqa.suse.de/tests/14976982
- https://openqa.suse.de/tests/14976930
- https://openqa.suse.de/tests/14976989
- https://openqa.suse.de/tests/14976953
- https://openqa.suse.de/tests/14976965
- https://openqa.suse.de/tests/14930578
Updated by nicksinger 4 months ago
ybonatakis wrote in #note-38:
Nick I manual revert the changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228/diffs in order to restart the nginx and bring back the server. please take a look
Not sure why this was not working. nginx -V |& grep perl
shows we have "--with-perl=/usr/bin/perl" but anyhow, might not help because it moves the timestamp generation into an earlier step so not pursuing this route further.
ph03nix wrote in #note-39:
The issue is back:
- https://openqa.suse.de/tests/14976907
- https://openqa.suse.de/tests/14976974
- https://openqa.suse.de/tests/14976918
- https://openqa.suse.de/tests/14976982
- https://openqa.suse.de/tests/14976930
- https://openqa.suse.de/tests/14976989
- https://openqa.suse.de/tests/14976953
- https://openqa.suse.de/tests/14976965
- https://openqa.suse.de/tests/14930578
Thanks, but I have to stress: the issue was never gone, just less worse. Also all the linked jobs you pasted got covered by auto-restart so no manual action needed currently
Updated by nicksinger 4 months ago
- Status changed from In Progress to Feedback
Created https://github.com/os-autoinst/openQA/pull/5782 which changes behavior and also adds some debug log message with both times in case a mismatch happens. I currently can't do much except moving slowly forward and observing the situation with my changes applied.
Updated by okurz 4 months ago
- Copied to action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:S added
Updated by nicksinger 4 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1233 to workaround currently not restarted tests due to https://progress.opensuse.org/issues/164296 - upstream changes to openQA already hot patched on OSD. Waiting for some example outputs now to further pinpoint the issue.
Updated by nicksinger 4 months ago
- Status changed from Feedback to In Progress
@okurz found https://openqa.suse.de/tests/14992170 - meaning our auto_restart_regex works as expected. Checking the logs of OSD I can see:
openqa:~ # journalctl --since=yesterday | grep "Timestamp mismatch over 300s"
Jul 23 10:30:22 openqa openqa[8091]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723065
Jul 23 10:30:22 openqa openqa[8091]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723065
[…]
Jul 23 10:30:23 openqa openqa[14892]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723120
Jul 23 10:30:23 openqa openqa[14892]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723120
Jul 23 10:30:23 openqa openqa[21151]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723119
Jul 23 10:30:23 openqa openqa[21151]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723119
the time perfectly correlates with an experiment to run fstrim @okurz was conducting ( https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=1721722084983&to=1721724906061 ).
I did some crude math on these numbers above with (cat timestamp_mismatch | xargs -I{} -- bash -c "echo '{}' | awk '{ print \$12 \$16 }'" | tr "," "-") | bc -l
and found that we have mismatches between 303 and 357 seconds, 353 on average. Going forward I will bump this timeout higher for now as a workaround until we understood and fix the underlying performance issue.
Updated by nicksinger 4 months ago
- Status changed from In Progress to Feedback
PR to configure the delta between two timestamps: https://github.com/os-autoinst/openQA/pull/5787
Updated by nicksinger 4 months ago
- Status changed from Feedback to Resolved
I validated that the openQA changes are deployed and applied my config change manually (including restarting services) for now until our pipelines work again. Until now we don't see the new error message which is expected and good. We discussed that this should be sufficient for now and other alerts (e.g. number of new incomplete jobs) should alert us if the situation gets worse.
Updated by okurz 4 months ago
- Copied to action #164418: Distinguish "timestamp mismatch" from cases where webUI is slow or where clocks are really differing added
Updated by nicksinger 4 months ago
- Has duplicate deleted (action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S)
Updated by nicksinger 4 months ago
- Related to action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S added