Project

General

Profile

Actions

action #162038

closed

No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S

Added by okurz 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

There were many logs for check whether clocks on the local host and the web UI host are in sync.
Openqa seems unavailable for ~10min

Jun 10 00:43:01 openqa systemd[1]: Started Session c46901 of User geekotest.                                                                                                                                                                 
Jun 10 00:43:01 openqa cron[13285]: pam_unix(crond:session): session opened for user geekotest by (uid=0)                                                                                                                                    
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3522 from worker_status (free)                                                                                                           
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3464 from worker_status (free)                                                                                                           
Jun 10 00:43:02 openqa openqa[32518]: [debug] Rejecting authentication for                                                                                                                                                                   
user "openqaworker4" with ip "10.100.96.76", valid key "...",                                                                                                                                                                   
secret "...", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync                                                                                                                                                                                                   
Jun 10 00:43:02 openqa openqa[4972]: [debug] Rejecting authentication for user "openqaworker4" with ip "**", valid key "**", secret "**", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Jun 10 00:43:02 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 1952 from worker_status (working)                                                                                                        
Jun 10 00:43:02 openqa openqa[745]: [debug] Rejecting authentication for user
...
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] Enqueued restarting openQA job 14563096 via Minion job 11664862                                                                                                                      
Jun 10 00:43:04 openqa openqa[1637]: [debug] [pid:1637] Enqueued restarting openQA job 14561971 via Minion job 11664863                                                                                                                      
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] _carry_over_candidate(14563096): _failure_reason=cleanup_before_shutdown:none,force_scheduled_tasks:none,grub_set_bootargs:none,hostname:none,shutdown:none                          
Jun 10 00:43:04 openqa openqa[745]: [debug] [pid:745] Enqueued restarting openQA job 14562006 via Minion job 11664864                                                                                                                        
Jun 10 00:43:04 openqa openqa[32535]: [debug] [pid:32535] _carry_over_candidate(14560798): _failure_reason=consoletest_finish:none,mariadb_srv:none,mtab:none,openvswitch:none,orphaned_packages_check:none,rsync:none,ssh_cleanup:none,sshd$
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557]                                                                                                                                                                                      
_carry_over_candidate(14563096): checking take over from 14558523:                                                                                                                                                                           
_failure_reason=GOOD

chrony status looks ok on OSD.

I saw also a python3 update and then issues with telegraf but I dont think they are related with OSD.

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#162038

Suggestions

Out of scope

  • Add details to bare "api failure" #163781
  • Teaching openqa-label-known-issues to look into reason when autoinst-log.txt #164296

Related issues 6 (1 open5 closed)

Related to openQA Tests - action #162644: Job fails with api failure: 403 response: timestamp mismatch after brief osd outageRejectedokurz2024-06-20

Actions
Related to openQA Infrastructure - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler

Actions
Related to openQA Infrastructure - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Related to openQA Project - action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:SResolvedmkittler2024-07-11

Actions
Copied to openQA Project - action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:SResolvedybonatakis

Actions
Copied to openQA Project - action #164418: Distinguish "timestamp mismatch" from cases where webUI is slow or where clocks are really differingNew2024-06-10

Actions
Actions #2

Updated by nicksinger 6 months ago

That doesn't seem to be the first time this happens. I just dumped the whole journal (journalctl | grep "Rejecting authentication") and looked trough the messages with cat /tmp/failed_auth_all | cut -d " " -f-3 | cut -d ":" -f-2 | sort | uniq -c:

    249 Jun 03 00:49
    360 Jun 03 00:55
      1 Jun 03 01:02
    351 Jun 10 00:43
    408 Jun 10 00:49
     70 Jun 10 10:27

The in the description mentioned worker, openqaworker4 is misleading because the account is actually called that way. The IP shows that this is in reality openqaworker18 and its clock looks good to me in that time frame: https://monitor.qa.suse.de/d/WDopenqaworker18/worker-dashboard-openqaworker18?orgId=1&from=1717972153700&to=1717979051591&viewPanel=65102

Unfortunately we don't have a similar panel for OSD itself. I think we had this issue in the past already and we just didn't notice. For sure room for improvement but I doubt it explains why OSD was unreachable for ~10 minutes.

These are the hosts which fail to connect:

openqa:~ # cat /tmp/failed_auth_all | cut -d \" -f 4 | sort | uniq | xargs -I{} dig -x {} +short
openqaworker14.qa.suse.cz.
openqaworker16.qa.suse.cz.
openqaworker17.qa.suse.cz.
openqaworker18.qa.suse.cz.
openqa.oqa.prg2.suse.org.
worker-arm2.oqa.prg2.suse.org.
worker-arm1.oqa.prg2.suse.org.
worker32.oqa.prg2.suse.org.
worker30.oqa.prg2.suse.org.
worker34.oqa.prg2.suse.org.
worker33.oqa.prg2.suse.org.
worker31.oqa.prg2.suse.org.
worker29.oqa.prg2.suse.org.
worker35.oqa.prg2.suse.org.
worker40.oqa.prg2.suse.org.
openqa-service.qe.prg2.suse.org.
Actions #3

Updated by okurz 6 months ago

  • Subject changed from No HTTP Response on OSD on 10-06-2024 to No HTTP Response on OSD on 10-06-2024 - "timestamp mismatch"
  • Priority changed from High to Immediate
Actions #4

Updated by nicksinger 6 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #5

Updated by nicksinger 6 months ago

  • Subject changed from No HTTP Response on OSD on 10-06-2024 - "timestamp mismatch" to No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry
  • Priority changed from Immediate to Urgent

Used dry_run=1 host=openqa.suse.de result="result='incomplete'" additional_filters="reason like '%mismatch%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs and restarted 17 tests:

{"result":[{"14560669":14567718}],"test_url":[{"14560669":"\/tests\/14567718"}]}
{"result":[{"14560744":14567719}],"test_url":[{"14560744":"\/tests\/14567719"}]}
{"result":[{"14560727":14567720}],"test_url":[{"14560727":"\/tests\/14567720"}]}
{"result":[{"14563663":14567721,"14563664":14567722}],"test_url":[{"14563663":"\/tests\/14567721","14563664":"\/tests\/14567722"}]}
{"result":[{"14563629":14567723}],"test_url":[{"14563629":"\/tests\/14567723"}]}
{"result":[{"14563623":14567724,"14563624":14567725,"14563625":14567726,"14563626":14567727}],"test_url":[{"14563623":"\/tests\/14567724","14563624":"\/tests\/14567725","14563625":"\/tests\/14567726","14563626":"\/tests\/14567727"}]}
{"result":[{"14563641":14567728,"14563642":14567729,"14563643":14567730,"14563644":14567731}],"test_url":[{"14563641":"\/tests\/14567728","14563642":"\/tests\/14567729","14563643":"\/tests\/14567730","14563644":"\/tests\/14567731"}]}
{"result":[{"14563665":14567732}],"test_url":[{"14563665":"\/tests\/14567732"}]}
{"result":[{"14563666":14567733}],"test_url":[{"14563666":"\/tests\/14567733"}]}
{"result":[{"14563672":14567734}],"test_url":[{"14563672":"\/tests\/14567734"}]}
{"result":[{"14560652":14567735}],"test_url":[{"14560652":"\/tests\/14567735"}]}
{"result":[{"14563484":14567736}],"test_url":[{"14563484":"\/tests\/14567736"}]}
{"result":[{"14563528":14567737}],"test_url":[{"14563528":"\/tests\/14567737"}]}
{"result":[{"14560837":14567738}],"test_url":[{"14560837":"\/tests\/14567738"}]}
{"result":[{"14560756":14567739}],"test_url":[{"14560756":"\/tests\/14567739"}]}
{"result":[{"14563529":14567740}],"test_url":[{"14563529":"\/tests\/14567740"}]}
{"enforceable":1,"errors":["Job 14561378 misses the following mandatory assets: hdd\/publiccloud_5.5_GCE-BYOS-Updates-x86_64.qcow2\nYou may try to retrigger the parent job that should create the assets and will implicitly retrigger this job as well."],"result":[],"test_url":[]}

also added an auto_review expression to the title to mitigate further issues

Actions #6

Updated by openqa_review 6 months ago

  • Due date set to 2024-06-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by nicksinger 6 months ago

  • Status changed from In Progress to Resolved

I checked our DB again with sudo -u geekotest psql --no-align --tuples-only --command="select id from jobs where (result='incomplete' and t_finished >= '2024-06-11' and reason like '%mismatch%');" openqa and can't find a single job failing. From an infra POV we currently can't come up with a solution or workaround. We think that the deployment in the middle of the day (Jun 10 10:27) caused people to find this issue more easily then before.

As I don't plan to change or improve anything here (except making deployment more stable in other ticket), I will close this now. When this becomes a bigger problem we might have to think about bigger architecture changes (e.g. get rid of the strict time dependency, make sure the updates don't slow down everything so clocks get out of sync, etc.)

Actions #8

Updated by okurz 5 months ago

  • Related to action #162644: Job fails with api failure: 403 response: timestamp mismatch after brief osd outage added
Actions #9

Updated by okurz 5 months ago · Edited

  • Due date deleted (2024-06-26)
  • Status changed from Resolved to New
  • Assignee deleted (nicksinger)

Happened again. Merging #162644 with this one as there is more information here:
openQA test in scenario sle-15-SP5-Online-QR-x86_64-create_hdd_minimal_base+sdk@64bit fails in
welcome
The test failed with this reason: api failure: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync

There were also reports of osd being inaccessible in https://suse.slack.com/archives/C02CANHLANP/p1718891972135439 around that time.
There was a ton of Updating seen of worker and influxdb/telegraf timing out sudo journalctl -S '2024-06-20 15:50:00' -U '2024-06-20 16:00:00'
The outage is also seen on grafana: https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1718889892165&to=1718893680225

Actions #10

Updated by okurz 5 months ago

  • Related to action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:M added
Actions #11

Updated by okurz 5 months ago · Edited

  • Status changed from New to Blocked
  • Assignee set to okurz
  • Priority changed from Urgent to Normal

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1718890439196&to=1718894021805 from the same timeframe shows the same symptoms as from #159396 and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1718890439196&to=1718894021805&viewPanel=179 shows a very active pg_dump after the period of unresponsiveness. We need a proper solution in #159396 first.

Done for i in failed incomplete parallel_failed; do host=openqa.suse.de result="result='$i'" additional_filters="reason like '%mismatch%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs; done

Actions #12

Updated by okurz 5 months ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)
  • Priority changed from Normal to High

#159396 was resolved. The symptom "timestamp mismatch" appeared again today, see https://suse.slack.com/archives/C02CANHLANP/p1720581833982129

Julie Cao Fail to clone test on OSD with Reason: api failure: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync, https://openqa.suse.de/tests/14873863#

Actions #13

Updated by livdywan 5 months ago

  • Related to action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #14

Updated by livdywan 5 months ago

This may have been caused by the issue suspected to underly #163592 since jobs ended up waiting long enough to cause the time to be off.

Actions #15

Updated by livdywan 5 months ago

  • Status changed from New to In Progress

I'm currently taking a closer look at what's happening, and to see what may be causing this separate from #163592.

Actions #16

Updated by livdywan 5 months ago

It's worth noting the following jobs hit this error without being picked up and restarted despite matching the regex:

None of the jobs seems to have worker or os-autoinst logs but that was also the case with previous occurrences.

Actions #17

Updated by livdywan 5 months ago

dry_run=1 host=openqa.suse.de result="result='incomplete'" additional_filters="reason like '%mismatch%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs

Reveals multiple affected jobs which are being restarted now.

Actions #18

Updated by livdywan 5 months ago · Edited

The most interesting event I could spot in sudo journalctl -S '2024-07-11 10:00:00' -U '2024-07-11 11:00:00 is this:

Jul 11 10:33:12 openqa systemd[1]: Started Session c67255 of User dehydrated

Several systemd messages about shutting down and restarting sessions right before those jobs failed.

Actions #19

Updated by livdywan 5 months ago · Edited

Apparently we're seeing more incompletes with different variants of "api error", and the journal contains multiple instances of this:

Jul 11 16:25:27 openqa wickedd[912]: restarted rtnetlink event listener                                                                                               
Jul 11 16:25:27 openqa wickedd[912]: rtnetlink event receive error: Out of memory (No buffer space available)
Actions #20

Updated by nicksinger 5 months ago

Some NTP/chrony-values from all machines as reference:

openqa:~ # chronyc tracking
Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
Stratum         : 3
Ref time (UTC)  : Thu Jul 11 13:57:27 2024
System time     : 0.000022083 seconds fast of NTP time
Last offset     : +0.000024303 seconds
RMS offset      : 0.000066092 seconds
Frequency       : 27.581 ppm slow
Residual freq   : +0.001 ppm
Skew            : 0.039 ppm
Root delay      : 0.001308891 seconds
Root dispersion : 0.004389539 seconds
Update interval : 1026.1 seconds
Leap status     : Normal
openqa:~ # salt --no-color \* cmd.run 'chronyc tracking'
openqa.suse.de:
    Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 13:57:27 2024
    System time     : 0.000021778 seconds fast of NTP time
    Last offset     : +0.000024303 seconds
    RMS offset      : 0.000066092 seconds
    Frequency       : 27.581 ppm slow
    Residual freq   : +0.001 ppm
    Skew            : 0.039 ppm
    Root delay      : 0.001308891 seconds
    Root dispersion : 0.004403387 seconds
    Update interval : 1026.1 seconds
    Leap status     : Normal
qamaster.qe.nue2.suse.org:
    Reference ID    : C3C98961 (defiant.tlercher.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:15 2024
    System time     : 0.000019144 seconds slow of NTP time
    Last offset     : -0.000010651 seconds
    RMS offset      : 0.000040153 seconds
    Frequency       : 25.721 ppm fast
    Residual freq   : +0.010 ppm
    Skew            : 0.066 ppm
    Root delay      : 0.004941278 seconds
    Root dispersion : 0.000637016 seconds
    Update interval : 64.2 seconds
    Leap status     : Normal
qesapworker-prg4.qa.suse.cz:
    Reference ID    : 4E6C60C5 (vip197.czela.net)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:35:23 2024
    System time     : 0.000043590 seconds slow of NTP time
    Last offset     : -0.000018665 seconds
    RMS offset      : 0.000030183 seconds
    Frequency       : 12.469 ppm slow
    Residual freq   : +0.000 ppm
    Skew            : 0.011 ppm
    Root delay      : 0.002343056 seconds
    Root dispersion : 0.001813403 seconds
    Update interval : 513.1 seconds
    Leap status     : Normal
backup-vm.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:24 2024
    System time     : 0.000067627 seconds slow of NTP time
    Last offset     : +0.000029119 seconds
    RMS offset      : 0.000088596 seconds
    Frequency       : 25.691 ppm fast
    Residual freq   : +0.000 ppm
    Skew            : 0.024 ppm
    Root delay      : 0.005426399 seconds
    Root dispersion : 0.002315616 seconds
    Update interval : 1035.6 seconds
    Leap status     : Normal
openqaworker1.qe.nue2.suse.org:
    Reference ID    : D5EFEA1C (server1b.meinberg.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:01 2024
    System time     : 0.000018288 seconds slow of NTP time
    Last offset     : -0.000010778 seconds
    RMS offset      : 0.000012211 seconds
    Frequency       : 55.573 ppm slow
    Residual freq   : -0.007 ppm
    Skew            : 0.062 ppm
    Root delay      : 0.004088305 seconds
    Root dispersion : 0.000172256 seconds
    Update interval : 130.0 seconds
    Leap status     : Normal
tumblesle.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:30:25 2024
    System time     : 0.000177187 seconds slow of NTP time
    Last offset     : -0.000153939 seconds
    RMS offset      : 0.000094980 seconds
    Frequency       : 25.823 ppm fast
    Residual freq   : -0.003 ppm
    Skew            : 0.024 ppm
    Root delay      : 0.005508567 seconds
    Root dispersion : 0.002260174 seconds
    Update interval : 1039.8 seconds
    Leap status     : Normal
storage.qe.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:16 2024
    System time     : 0.000027968 seconds slow of NTP time
    Last offset     : -0.000006329 seconds
    RMS offset      : 0.000016318 seconds
    Frequency       : 38.513 ppm slow
    Residual freq   : -0.000 ppm
    Skew            : 0.009 ppm
    Root delay      : 0.001057816 seconds
    Root dispersion : 0.001494096 seconds
    Update interval : 516.7 seconds
    Leap status     : Normal
qesapworker-prg5.qa.suse.cz:
    Reference ID    : 4E6C60C5 (vip197.czela.net)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:38:03 2024
    System time     : 0.000044509 seconds fast of NTP time
    Last offset     : +0.000044949 seconds
    RMS offset      : 0.000075088 seconds
    Frequency       : 15.335 ppm slow
    Residual freq   : +0.001 ppm
    Skew            : 0.010 ppm
    Root delay      : 0.002453622 seconds
    Root dispersion : 0.001443175 seconds
    Update interval : 1036.7 seconds
    Leap status     : Normal
imagetester.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:29:53 2024
    System time     : 0.000046080 seconds fast of NTP time
    Last offset     : -0.000012911 seconds
    RMS offset      : 0.000062851 seconds
    Frequency       : 66.388 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.017 ppm
    Root delay      : 0.005378101 seconds
    Root dispersion : 0.002174454 seconds
    Update interval : 1037.0 seconds
    Leap status     : Normal
jenkins.qe.nue2.suse.org:
    Reference ID    : 83BC03DD (ntp1.rrze.uni-erlangen.de)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:29:32 2024
    System time     : 0.000091726 seconds slow of NTP time
    Last offset     : +0.000001189 seconds
    RMS offset      : 0.000068897 seconds
    Frequency       : 25.711 ppm fast
    Residual freq   : +0.001 ppm
    Skew            : 0.066 ppm
    Root delay      : 0.006706720 seconds
    Root dispersion : 0.002170046 seconds
    Update interval : 514.7 seconds
    Leap status     : Normal
openqaw5-xen.qe.prg2.suse.org:
    Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:41:14 2024
    System time     : 0.000026322 seconds fast of NTP time
    Last offset     : +0.000008645 seconds
    RMS offset      : 0.000013326 seconds
    Frequency       : 19.957 ppm slow
    Residual freq   : +0.000 ppm
    Skew            : 0.009 ppm
    Root delay      : 0.001055759 seconds
    Root dispersion : 0.001510400 seconds
    Update interval : 258.2 seconds
    Leap status     : Normal
baremetal-support.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:39:59 2024
    System time     : 0.000289661 seconds slow of NTP time
    Last offset     : -0.000209298 seconds
    RMS offset      : 0.000106703 seconds
    Frequency       : 25.723 ppm fast
    Residual freq   : -0.017 ppm
    Skew            : 0.075 ppm
    Root delay      : 0.005447044 seconds
    Root dispersion : 0.002248233 seconds
    Update interval : 1032.8 seconds
    Leap status     : Normal
diesel.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:37:41 2024
    System time     : 0.000083468 seconds slow of NTP time
    Last offset     : +0.000029892 seconds
    RMS offset      : 0.000053862 seconds
    Frequency       : 12.831 ppm slow
    Residual freq   : +0.000 ppm
    Skew            : 0.014 ppm
    Root delay      : 0.005361831 seconds
    Root dispersion : 0.001741083 seconds
    Update interval : 514.3 seconds
    Leap status     : Normal
openqaworker16.qa.suse.cz:
    Reference ID    : 4E50AC85 (78.80.172.133)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:36:10 2024
    System time     : 0.000095464 seconds fast of NTP time
    Last offset     : +0.000010597 seconds
    RMS offset      : 0.000050793 seconds
    Frequency       : 30.884 ppm slow
    Residual freq   : +0.003 ppm
    Skew            : 0.042 ppm
    Root delay      : 0.003768685 seconds
    Root dispersion : 0.000559975 seconds
    Update interval : 258.1 seconds
    Leap status     : Normal
worker35.oqa.prg2.suse.org:
    Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:45 2024
    System time     : 0.000008578 seconds slow of NTP time
    Last offset     : -0.000014809 seconds
    RMS offset      : 0.000054246 seconds
    Frequency       : 12.161 ppm slow
    Residual freq   : -0.022 ppm
    Skew            : 0.152 ppm
    Root delay      : 0.001055162 seconds
    Root dispersion : 0.001556702 seconds
    Update interval : 64.3 seconds
    Leap status     : Normal
sapworker1.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:41 2024
    System time     : 0.000138152 seconds fast of NTP time
    Last offset     : +0.000006250 seconds
    RMS offset      : 0.000065712 seconds
    Frequency       : 2.265 ppm fast
    Residual freq   : +0.001 ppm
    Skew            : 0.066 ppm
    Root delay      : 0.005746688 seconds
    Root dispersion : 0.001819121 seconds
    Update interval : 258.2 seconds
    Leap status     : Normal
monitor.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:33:43 2024
    System time     : 0.000173209 seconds slow of NTP time
    Last offset     : -0.000073582 seconds
    RMS offset      : 0.000067978 seconds
    Frequency       : 25.700 ppm fast
    Residual freq   : -0.001 ppm
    Skew            : 0.031 ppm
    Root delay      : 0.005451505 seconds
    Root dispersion : 0.002243097 seconds
    Update interval : 1041.1 seconds
    Leap status     : Normal
worker32.oqa.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:59 2024
    System time     : 0.000022194 seconds fast of NTP time
    Last offset     : +0.000006747 seconds
    RMS offset      : 0.000010521 seconds
    Frequency       : 20.574 ppm slow
    Residual freq   : +0.000 ppm
    Skew            : 0.017 ppm
    Root delay      : 0.001044271 seconds
    Root dispersion : 0.002068724 seconds
    Update interval : 64.7 seconds
    Leap status     : Normal
openqaworker18.qa.suse.cz:
    Reference ID    : D5C036E8 (213.192.54.232)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:41:46 2024
    System time     : 0.000052862 seconds slow of NTP time
    Last offset     : -0.000002611 seconds
    RMS offset      : 0.000075304 seconds
    Frequency       : 36.898 ppm slow
    Residual freq   : -0.001 ppm
    Skew            : 0.042 ppm
    Root delay      : 0.002779522 seconds
    Root dispersion : 0.000671763 seconds
    Update interval : 260.3 seconds
    Leap status     : Normal
sapworker3.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:27:09 2024
    System time     : 0.000080542 seconds fast of NTP time
    Last offset     : +0.000005551 seconds
    RMS offset      : 0.000063281 seconds
    Frequency       : 3.990 ppm fast
    Residual freq   : +0.001 ppm
    Skew            : 0.024 ppm
    Root delay      : 0.005348192 seconds
    Root dispersion : 0.002241552 seconds
    Update interval : 1031.0 seconds
    Leap status     : Normal
mania.qe.nue2.suse.org:
    Reference ID    : C03C5811 (ntp2.m-online.net)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:14 2024
    System time     : 0.000005734 seconds slow of NTP time
    Last offset     : -0.000007869 seconds
    RMS offset      : 0.000026844 seconds
    Frequency       : 23.346 ppm slow
    Residual freq   : -0.001 ppm
    Skew            : 0.027 ppm
    Root delay      : 0.005080119 seconds
    Root dispersion : 0.000471100 seconds
    Update interval : 259.6 seconds
    Leap status     : Normal
openqaworker17.qa.suse.cz:
    Reference ID    : 0501387B (ntp.suas.cz)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:36:29 2024
    System time     : 0.000211145 seconds slow of NTP time
    Last offset     : -0.000442668 seconds
    RMS offset      : 0.000230532 seconds
    Frequency       : 38.706 ppm slow
    Residual freq   : -0.032 ppm
    Skew            : 0.277 ppm
    Root delay      : 0.006446593 seconds
    Root dispersion : 0.000680730 seconds
    Update interval : 1041.3 seconds
    Leap status     : Normal
schort-server.qe.nue2.suse.org:
    Reference ID    : C3C98961 (defiant.tlercher.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:19 2024
    System time     : 0.000065558 seconds fast of NTP time
    Last offset     : +0.000026393 seconds
    RMS offset      : 0.000062144 seconds
    Frequency       : 25.702 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.020 ppm
    Root delay      : 0.005218701 seconds
    Root dispersion : 0.001101224 seconds
    Update interval : 130.0 seconds
    Leap status     : Normal
sapworker2.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:01 2024
    System time     : 0.000100719 seconds fast of NTP time
    Last offset     : +0.000037721 seconds
    RMS offset      : 0.000043879 seconds
    Frequency       : 0.090 ppm fast
    Residual freq   : +0.001 ppm
    Skew            : 0.027 ppm
    Root delay      : 0.005470708 seconds
    Root dispersion : 0.001831063 seconds
    Update interval : 512.1 seconds
    Leap status     : Normal
petrol.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:26:40 2024
    System time     : 0.000170613 seconds slow of NTP time
    Last offset     : -0.000278480 seconds
    RMS offset      : 0.000129630 seconds
    Frequency       : 18.311 ppm slow
    Residual freq   : -0.010 ppm
    Skew            : 0.070 ppm
    Root delay      : 0.005347948 seconds
    Root dispersion : 0.003424126 seconds
    Update interval : 1043.3 seconds
    Leap status     : Normal
ada.qe.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:36:27 2024
    System time     : 0.000016693 seconds fast of NTP time
    Last offset     : -0.000013316 seconds
    RMS offset      : 0.000014862 seconds
    Frequency       : 16.582 ppm fast
    Residual freq   : -0.001 ppm
    Skew            : 0.011 ppm
    Root delay      : 0.001102176 seconds
    Root dispersion : 0.001735037 seconds
    Update interval : 513.4 seconds
    Leap status     : Normal
worker-arm1.oqa.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:26:24 2024
    System time     : 0.000007647 seconds fast of NTP time
    Last offset     : +0.000080160 seconds
    RMS offset      : 0.000034377 seconds
    Frequency       : 5.537 ppm slow
    Residual freq   : +0.005 ppm
    Skew            : 0.024 ppm
    Root delay      : 0.001136960 seconds
    Root dispersion : 0.002377458 seconds
    Update interval : 517.0 seconds
    Leap status     : Normal
openqaworker14.qa.suse.cz:
    Reference ID    : 4E6C60C5 (vip197.czela.net)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:36:43 2024
    System time     : 0.000310024 seconds slow of NTP time
    Last offset     : -0.000359515 seconds
    RMS offset      : 0.000177298 seconds
    Frequency       : 7.105 ppm slow
    Residual freq   : -0.016 ppm
    Skew            : 0.115 ppm
    Root delay      : 0.002246775 seconds
    Root dispersion : 0.002097186 seconds
    Update interval : 2077.0 seconds
    Leap status     : Normal
osiris-1.qe.nue2.suse.org:
    Reference ID    : 4F852C8E (79.133.44.142)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:40:34 2024
    System time     : 0.000006179 seconds slow of NTP time
    Last offset     : -0.000007087 seconds
    RMS offset      : 0.000014982 seconds
    Frequency       : 11.098 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.018 ppm
    Root delay      : 0.003305381 seconds
    Root dispersion : 0.000208739 seconds
    Update interval : 259.5 seconds
    Leap status     : Normal
openqaworker-arm-1.qe.nue2.suse.org:
    Reference ID    : BF2D4343 (server1b.meinberg.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:41:28 2024
    System time     : 0.000093359 seconds slow of NTP time
    Last offset     : -0.000007283 seconds
    RMS offset      : 0.000033947 seconds
    Frequency       : 24.445 ppm slow
    Residual freq   : -0.001 ppm
    Skew            : 0.024 ppm
    Root delay      : 0.004061557 seconds
    Root dispersion : 0.000189946 seconds
    Update interval : 515.2 seconds
    Leap status     : Normal
backup-qam.qe.nue2.suse.org:
    Reference ID    : C3C98961 (defiant.tlercher.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:24 2024
    System time     : 0.000006009 seconds fast of NTP time
    Last offset     : +0.000001472 seconds
    RMS offset      : 0.000033058 seconds
    Frequency       : 19.443 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.055 ppm
    Root delay      : 0.005161884 seconds
    Root dispersion : 0.000541619 seconds
    Update interval : 65.0 seconds
    Leap status     : Normal
openqa-piworker.qe.nue2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:40:42 2024
    System time     : 0.000003187 seconds fast of NTP time
    Last offset     : -0.000018070 seconds
    RMS offset      : 0.000012431 seconds
    Frequency       : 9.311 ppm fast
    Residual freq   : -0.010 ppm
    Skew            : 0.059 ppm
    Root delay      : 0.005354426 seconds
    Root dispersion : 0.001506209 seconds
    Update interval : 258.0 seconds
    Leap status     : Normal
unreal6.qe.nue2.suse.org:
    Reference ID    : D5AC696A (ns1.blazing.de)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:30:00 2024
    System time     : 0.000075821 seconds fast of NTP time
    Last offset     : +0.000169504 seconds
    RMS offset      : 0.000227368 seconds
    Frequency       : 27.267 ppm fast
    Residual freq   : +0.055 ppm
    Skew            : 0.156 ppm
    Root delay      : 0.003834251 seconds
    Root dispersion : 0.002034988 seconds
    Update interval : 1029.7 seconds
    Leap status     : Normal
qesapworker-prg6.qa.suse.cz:
    Reference ID    : 4E6C60C5 (vip197.czela.net)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:34:48 2024
    System time     : 0.000066652 seconds fast of NTP time
    Last offset     : +0.000069314 seconds
    RMS offset      : 0.000084554 seconds
    Frequency       : 7.334 ppm slow
    Residual freq   : +0.004 ppm
    Skew            : 0.018 ppm
    Root delay      : 0.002232490 seconds
    Root dispersion : 0.001841876 seconds
    Update interval : 2080.7 seconds
    Leap status     : Normal
worker29.oqa.prg2.suse.org:
    Reference ID    : AEA9312C (2a07:de40:b205:26:10:144:55:130)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:23 2024
    System time     : 0.000015936 seconds fast of NTP time
    Last offset     : +0.000011206 seconds
    RMS offset      : 0.000020762 seconds
    Frequency       : 11.054 ppm slow
    Residual freq   : +0.002 ppm
    Skew            : 0.061 ppm
    Root delay      : 0.003176525 seconds
    Root dispersion : 0.000410482 seconds
    Update interval : 65.1 seconds
    Leap status     : Normal
qesapworker-prg7.qa.suse.cz:
    Reference ID    : 4E6C60C5 (vip197.czela.net)
    Stratum         : 2
    Ref time (UTC)  : Thu Jul 11 14:36:00 2024
    System time     : 0.000053961 seconds fast of NTP time
    Last offset     : +0.000056638 seconds
    RMS offset      : 0.000071415 seconds
    Frequency       : 5.603 ppm slow
    Residual freq   : +0.000 ppm
    Skew            : 0.009 ppm
    Root delay      : 0.002314058 seconds
    Root dispersion : 0.001585285 seconds
    Update interval : 2074.7 seconds
    Leap status     : Normal
grenache-1.oqa.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:30:33 2024
    System time     : 0.000051045 seconds fast of NTP time
    Last offset     : +0.000001856 seconds
    RMS offset      : 0.000103933 seconds
    Frequency       : 7.286 ppm slow
    Residual freq   : -0.000 ppm
    Skew            : 0.029 ppm
    Root delay      : 0.001180847 seconds
    Root dispersion : 0.002281780 seconds
    Update interval : 1034.7 seconds
    Leap status     : Normal
worker-arm2.oqa.prg2.suse.org:
    Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:39:56 2024
    System time     : 0.000105552 seconds slow of NTP time
    Last offset     : -0.000024404 seconds
    RMS offset      : 0.000042628 seconds
    Frequency       : 7.849 ppm slow
    Residual freq   : -0.001 ppm
    Skew            : 0.013 ppm
    Root delay      : 0.001032133 seconds
    Root dispersion : 0.001510956 seconds
    Update interval : 256.7 seconds
    Leap status     : Normal
s390zl12.oqa.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 13:53:31 2024
    System time     : 0.000012752 seconds slow of NTP time
    Last offset     : +0.000000963 seconds
    RMS offset      : 0.000022999 seconds
    Frequency       : 0.050 ppm fast
    Residual freq   : -0.000 ppm
    Skew            : 0.008 ppm
    Root delay      : 0.001468576 seconds
    Root dispersion : 0.004334040 seconds
    Update interval : 517.5 seconds
    Leap status     : Normal
worker33.oqa.prg2.suse.org:
    Reference ID    : 0A903781 (kronos01.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:38:59 2024
    System time     : 0.000001046 seconds fast of NTP time
    Last offset     : -0.000000740 seconds
    RMS offset      : 0.000018752 seconds
    Frequency       : 15.258 ppm slow
    Residual freq   : -0.000 ppm
    Skew            : 0.017 ppm
    Root delay      : 0.001051913 seconds
    Root dispersion : 0.001512931 seconds
    Update interval : 516.7 seconds
    Leap status     : Normal
worker34.oqa.prg2.suse.org:
    Reference ID    : 0A903782 (kronos02.prg2.suse.org)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:43:08 2024
    System time     : 0.000043618 seconds fast of NTP time
    Last offset     : +0.000027052 seconds
    RMS offset      : 0.000025032 seconds
    Frequency       : 7.000 ppm slow
    Residual freq   : +0.030 ppm
    Skew            : 0.169 ppm
    Root delay      : 0.003190971 seconds
    Root dispersion : 0.000613747 seconds
    Update interval : 130.4 seconds
    Leap status     : Normal
worker30.oqa.prg2.suse.org:
    Reference ID    : 3FDD6B1D (2a07:de40:b205:26:10:144:55:129)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:40:31 2024
    System time     : 0.000018638 seconds slow of NTP time
    Last offset     : -0.000069619 seconds
    RMS offset      : 0.000037803 seconds
    Frequency       : 14.052 ppm slow
    Residual freq   : -0.063 ppm
    Skew            : 0.194 ppm
    Root delay      : 0.001035445 seconds
    Root dispersion : 0.001595671 seconds
    Update interval : 130.4 seconds
    Leap status     : Normal
worker40.oqa.prg2.suse.org:
    Reference ID    : AEA9312C (2a07:de40:b205:26:10:144:55:130)
    Stratum         : 3
    Ref time (UTC)  : Thu Jul 11 14:42:35 2024
    System time     : 0.000030772 seconds fast of NTP time
    Last offset     : +0.000001950 seconds
    RMS offset      : 0.000130650 seconds
    Frequency       : 7.392 ppm slow
    Residual freq   : -0.001 ppm
    Skew            : 0.087 ppm
    Root delay      : 0.003208488 seconds
    Root dispersion : 0.000277519 seconds
    Update interval : 65.1 seconds
    Leap status     : Normal
s390zl13.oqa.prg2.suse.org:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240711144328940954
ERROR: Minions returned with non-zero exit code

The "System time" and the therefore reported NTP offset seems to be way too low to cause any problems. We usually saw api authentication issues with time skew in the range of multiple seconds.

Actions #21

Updated by livdywan 5 months ago

livdywan wrote in #note-17:

dry_run=1 host=openqa.suse.de result="result='incomplete'" additional_filters="reason like '%api failure%'" comment="label:poo162038" ./openqa-advanced-retrigger-jobs

Retriggering jobs with api failure to mitigate while the cause is still unclear.

Actions #22

Updated by livdywan 5 months ago

  • Has duplicate action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S added
Actions #23

Updated by livdywan 5 months ago

A few kernel jobs have failed during upload phase with quite non-descript reason: "api failure". As a result, there's no autoinst-log.txt nor worker-log.txt.
https://openqa.suse.de/tests/14897579
https://openqa.suse.de/tests/14897580
https://openqa.suse.de/tests/14895759

Actions #25

Updated by okurz 5 months ago

  • Assignee set to okurz
Actions #26

Updated by okurz 5 months ago

  • Assignee changed from okurz to livdywan
Actions #27

Updated by livdywan 5 months ago

  • Status changed from In Progress to New
  • Assignee deleted (livdywan)

MDoucha wrote in #note-24:

More examples of unspecified "api failure":

Thank you for these examples. The last two especially seem to confirm it's related to uploading of assets even when the job is otherwise running fine:

https://openqa.suse.de/tests/14895760 - with logs

autoinst-log.txt

[2024-07-11T17:03:20.278926+02:00] [debug] [pid:8428] backend process exited: 0
Use of uninitialized value $name in pattern match (m//) at /usr/lib/os-autoinst/OpenQA/Isotovideo/Utils.pm line 332.
    OpenQA::Isotovideo::Utils::handle_generated_assets(OpenQA::Isotovideo::CommandHandler=HASH(0x55c503301a38), 1) called at /usr/bin/isotovideo line 152
    main::handle_shutdown() called at /usr/bin/isotovideo line 187
    eval {...} called at /usr/bin/isotovideo line 176
[2024-07-11T17:03:20.383373+02:00] [debug] [pid:8428] stopping backend process 8637
[...]
[2024-07-11T17:07:04.861866+02:00] [info] [pid:27937] Uploading textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170344+02:00] [error] [pid:27937] All 60 upload attempts have failed for textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170616+02:00] [error] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: Forbidden

worker-log.txt

[2024-07-11T17:15:12.171535+02:00] [debug] [pid:64848] Upload concluded (no current module)
[2024-07-11T17:15:21.169877+02:00] [warn] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
[2024-07-11T17:15:21.170437+02:00] [error] [pid:27937] All 60 upload attempts have failed for textinfo-logs.tar.bz2
[2024-07-11T17:15:21.170670+02:00] [error] [pid:27937] Error uploading textinfo-logs.tar.bz2: 403 response: Forbidden
[2024-07-11T17:15:21.191433+02:00] [info] [pid:27937] Uploading video.webm

https://openqa.suse.de/tests/14896597 - also with logs

autoinst-log.txt

[2024-07-11T17:19:06.909676+02:00] [info] [pid:30134] Result: done
[2024-07-11T17:19:06.914566+02:00] [info] [pid:44561] Uploading patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193604+02:00] [error] [pid:44561] All 60 upload attempts have failed for patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193858+02:00] [error] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: Forbidden
[2024-07-11T17:28:00.201425+02:00] [info] [pid:44561] Uploading video.webm

worker-log.txt

[2024-07-11T17:27:52.117556+02:00] [debug] [pid:30134] REST-API call: POST http://openqa.suse.de/api/v1/jobs/14896597/status
[2024-07-11T17:27:52.653861+02:00] [debug] [pid:30134] Upload concluded (no current module)
[2024-07-11T17:28:00.193340+02:00] [warn] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
[2024-07-11T17:28:00.193681+02:00] [error] [pid:44561] All 60 upload attempts have failed for patch_and_reboot-kernel_changelog.log
[2024-07-11T17:28:00.193921+02:00] [error] [pid:44561] Error uploading patch_and_reboot-kernel_changelog.log: 403 response: Forbidden
[2024-07-11T17:28:00.201382+02:00] [info] [pid:44561] Uploading video.webm

I suggest to estimate this ticket and decide how to best handle these.

Actions #28

Updated by okurz 5 months ago

  • Subject changed from No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry to No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #29

Updated by okurz 5 months ago

  • Priority changed from High to Urgent
Actions #31

Updated by mkittler 5 months ago

Also see #163592#note-34 for an update and how to query for relevant jobs. I have also restarted relevant jobs as part of this ticket.

Actions #32

Updated by nicksinger 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to nicksinger
Actions #33

Updated by openqa_review 5 months ago

  • Due date set to 2024-07-30

Setting due date based on mean cycle time of SUSE QE Tools

Actions #34

Updated by nicksinger 5 months ago

  • Priority changed from Urgent to High
openqa=> select count(id), min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, reason from jobs where t_finished >= '2024-07-16T08:00:00' and result = 'incomplete' group by reason order by count(id) desc;
 count |  first_t_finished   |   last_t_finished   |                                                                                                                                                    reason
-------+---------------------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    25 | 2024-07-16 08:14:51 | 2024-07-17 03:00:12 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5901>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
    23 | 2024-07-16 08:48:03 | 2024-07-16 15:38:38 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5907>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
    23 | 2024-07-16 08:42:46 | 2024-07-16 15:55:46 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
    20 | 2024-07-16 10:56:23 | 2024-07-17 04:39:37 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
    18 | 2024-07-16 18:35:12 | 2024-07-16 18:56:12 | asset failure: Failed to download SLE-15-SP4-Full-ppc64le-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-ppc64le-Build190.2-Media1.iso.sha256
    16 | 2024-07-16 17:49:27 | 2024-07-17 02:01:31 | tests died: unable to load main.pm, check the log for the cause (e.g. syntax error)
    13 | 2024-07-16 17:02:18 | 2024-07-17 02:19:48 | backend died: QEMU exited unexpectedly, see log for details
    10 | 2024-07-16 11:00:58 | 2024-07-17 07:17:11 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5903>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
    10 | 2024-07-16 17:12:07 | 2024-07-16 17:48:14 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build207.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build207.1-Media1.iso
    10 | 2024-07-16 17:19:03 | 2024-07-16 18:13:39 | asset failure: Failed to download SLE-15-SP4-Full-x86_64-Build208.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-x86_64-Build208.1-Media1.iso
     9 | 2024-07-16 08:06:23 | 2024-07-17 05:02:10 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5902>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
     8 | 2024-07-16 17:58:39 | 2024-07-16 17:59:35 | asset failure: Failed to download SLE-15-SP4-Full-aarch64-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-aarch64-Build208.1-Media1.iso.sha256
     8 | 2024-07-16 17:49:19 | 2024-07-16 17:52:40 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build207.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build207.1-Media1.iso
     8 | 2024-07-16 09:14:22 | 2024-07-16 09:46:59 | quit: worker has been stopped or restarted
     8 | 2024-07-16 17:49:00 | 2024-07-16 18:00:12 | asset failure: Failed to download SLE-15-SP4-Full-s390x-Build208.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-s390x-Build208.1-Media1.iso
     7 | 2024-07-16 08:08:31 | 2024-07-16 10:44:56 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5904>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
     6 | 2024-07-17 02:30:00 | 2024-07-17 02:46:23 | tests died: unable to load tests/console/validate_raid_encrypt_home.pm, check the log for the cause (e.g. syntax error)
     5 | 2024-07-16 12:44:27 | 2024-07-17 08:29:55 | backend died: QEMU terminated before QMP connection could be established. Check for errors below
     4 | 2024-07-17 06:31:22 | 2024-07-17 07:19:18 | asset failure: Failed to download SLES-15-SP5-x86_64-create_hdd_yast_maintenance_desktop_deregistered.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-x86_64-create_hdd_yast_maintenance_desktop_deregistered.qcow2
     4 | 2024-07-16 17:11:22 | 2024-07-16 18:26:03 | asset failure: Failed to download SLE-15-SP4-Online-ppc64le-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build208.1-Media1.iso.sha256
     4 | 2024-07-16 17:39:36 | 2024-07-16 17:58:40 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build191.4-Media1.iso.sha256
     4 | 2024-07-16 17:02:01 | 2024-07-16 17:02:56 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2
     3 | 2024-07-16 17:01:39 | 2024-07-16 17:02:10 | asset failure: Failed to download sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2
     3 | 2024-07-16 09:53:10 | 2024-07-16 10:39:59 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5908>: IO::Socket::INET: connect: Connection refused [Auto-restarting because reason matches the configured "auto_clone_regex".]
     2 | 2024-07-16 23:22:35 | 2024-07-17 01:23:02 | asset failure: Failed to download SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256
     2 | 2024-07-16 17:38:41 | 2024-07-16 17:38:49 | asset failure: Failed to download SLE-15-SP4-Full-x86_64-Build195.1-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Full-x86_64-Build195.1-Media1.iso
     2 | 2024-07-16 17:48:09 | 2024-07-16 17:48:10 | asset failure: Failed to download SLE-15-SP4-Online-aarch64-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-aarch64-Build190.2-Media1.iso.sha256
     2 | 2024-07-16 15:08:08 | 2024-07-16 15:08:13 | asset failure: Failed to download SLE-12-SP5-x86_64-Build:34795:saptune-qam-sles4sap-gnome.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLE-12-SP5-x86_64-Build:34795:saptune-qam-sles4sap-gnome.qcow2
     2 | 2024-07-16 17:45:05 | 2024-07-16 17:45:07 | asset failure: Failed to download SLE-15-SP4-Online-s390x-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build191.4-Media1.iso.sha256
     2 | 2024-07-16 17:39:34 | 2024-07-16 17:39:42 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build190.2-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build190.2-Media1.iso.sha256
     2 | 2024-07-16 18:04:17 | 2024-07-16 19:22:34 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build191.4-Media1.iso to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build191.4-Media1.iso
     2 | 2024-07-16 17:34:55 | 2024-07-16 17:34:57 | asset failure: Failed to download SLE-15-SP4-Online-x86_64-Build195.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-x86_64-Build195.1-Media1.iso.sha256
     2 | 2024-07-16 10:13:02 | 2024-07-16 12:10:50 | asset failure: Failed to download SLES-15-SP3-x86_64-mru-install-minimal-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP3-x86_64-mru-install-minimal-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2
     2 | 2024-07-16 22:03:42 | 2024-07-16 22:05:13 | asset failure: Failed to download autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20240716-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-15-SP5-x86_64-create_hdd_yast_maintenance_minimal-Build20240716-1-Server-DVD-Updates-64bit.qcow2
     2 | 2024-07-16 17:43:07 | 2024-07-16 17:43:14 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit-amd.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit-amd.qcow2
     2 | 2024-07-16 17:43:44 | 2024-07-16 17:43:52 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-gnome_ay_qesec@64bit.qcow2
     2 | 2024-07-16 17:36:55 | 2024-07-16 17:36:56 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit-amd.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit-amd.qcow2
     2 | 2024-07-16 17:39:09 | 2024-07-16 17:39:12 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@64bit.qcow2
     2 | 2024-07-16 18:07:48 | 2024-07-16 18:07:51 | asset failure: Failed to download sle-15-SP4-x86_64-191.4-textmode_ay_qesec@uefi.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-191.4-textmode_ay_qesec@uefi.qcow2
     2 | 2024-07-16 17:34:37 | 2024-07-16 17:34:40 | asset failure: Failed to download sle-15-SP4-x86_64-207.1-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-207.1-textmode_ay_qesec@64bit.qcow2
     2 | 2024-07-16 15:03:44 | 2024-07-16 16:12:30 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
     2 | 2024-07-17 06:10:10 | 2024-07-17 07:07:21 | tests died: unable to load tests/autoyast/installation.pm, check the log for the cause (e.g. syntax error)
     2 | 2024-07-16 21:31:54 | 2024-07-16 22:09:05 | tests died: unable to load tests/network/samba/samba_adcli.pm, check the log for the cause (e.g. syntax error)
     1 | 2024-07-16 14:48:03 | 2024-07-16 14:48:03 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5905>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
     1 | 2024-07-16 15:57:07 | 2024-07-16 15:57:07 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5907>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
     1 | 2024-07-16 12:39:26 | 2024-07-16 12:39:26 | backend died: Error connecting to VNC server <openqaw5-xen.qe.prg2.suse.org:5903>: IO::Socket::INET: connect: Connection refused [Not restarting job despite failure reason matching the configured "auto_clone_regex". It has already been restarted 20 times (limit is 20).]
     1 | 2024-07-16 17:01:53 | 2024-07-16 17:01:53 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-textmode_ay_qesec@64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode_ay_qesec@64bit.qcow2
     1 | 2024-07-16 17:02:22 | 2024-07-16 17:02:22 | asset failure: Failed to download sle-15-SP4-x86_64-208.7-gnome@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-gnome@64bit-fips.qcow2
     1 | 2024-07-17 08:34:41 | 2024-07-17 08:34:41 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30097. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 01:54:13 | 2024-07-17 01:54:13 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30478. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:03:35 | 2024-07-17 03:03:35 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 30701. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:27:17 | 2024-07-17 03:27:17 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 31505. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 08:33:49 | 2024-07-17 08:33:49 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 32030. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:00:56 | 2024-07-17 03:00:56 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 32264. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:04:23 | 2024-07-17 03:04:23 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34150. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:27:18 | 2024-07-17 03:27:18 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34451. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 03:29:34 | 2024-07-17 03:29:34 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 34695. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 01:27:40 | 2024-07-17 01:27:40 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 35376. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 01:31:48 | 2024-07-17 01:31:48 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 35532. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 01:58:31 | 2024-07-17 01:58:31 | backend died: The file descriptor 13 (baseclass::cmdpipe) hit the read attempts threshold of 30000/30s by 36968. Active console 'root-virtio-terminal1' is not responding, it could be a half-open socket or you need to increase _CHKSEL_RATE_HITS value. Make sure the console is reachable or disable sta…
     1 | 2024-07-17 01:06:55 | 2024-07-17 01:06:55 | backend died: Virtio terminal and svirt serial terminal do not support send_key. Use
     1 | 2024-07-17 08:53:08 | 2024-07-17 08:53:08 | backend died: unexpected end of data at /usr/lib/os-autoinst/consoles/VNC.pm line 842.
     1 | 2024-07-16 23:24:07 | 2024-07-16 23:24:07 | cache failure: Failed to download SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-ppc64le-Build191.4-Media1.iso.sha256 [Auto-restarting because reason matches the configured "auto_clone_regex".]
     1 | 2024-07-16 18:09:47 | 2024-07-16 18:09:47 | cache failure: Failed to download SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 [Auto-restarting because reason matches the configured "auto_clone_regex".]
     1 | 2024-07-16 17:01:38 | 2024-07-16 17:01:38 | cache failure: Failed to download sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-ppc64le-208.7-gnome_ay_qesec@ppc64le-2g.qcow2 [Auto-restarting because reason matches the configured "auto_clone_regex".]
     1 | 2024-07-16 17:02:33 | 2024-07-16 17:02:33 | cache failure: Failed to download sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 to /var/lib/openqa/cache/openqa.suse.de/sle-15-SP4-x86_64-208.7-textmode@64bit-fips.qcow2 [Auto-restarting because reason matches the configured "auto_clone_regex".]
     1 | 2024-07-16 21:27:57 | 2024-07-16 21:27:57 | asset failure: Failed to download autoyast_SLES-12-SP5-x86_64-create_hdd_yast_maintenance_desktop-Build20240716-1.qcow2 to /var/lib/openqa/cache/openqa.suse.de/autoyast_SLES-12-SP5-x86_64-create_hdd_yast_maintenance_desktop-Build20240716-1.qcow2
     1 | 2024-07-17 07:42:07 | 2024-07-17 07:42:07 | asset failure: Failed to download SLES-15-SP5-x86_64-mru-install-desktop-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2 to /var/lib/openqa/cache/openqa.suse.de/SLES-15-SP5-x86_64-mru-install-desktop-with-addons_security-Build20240715-1-Server-DVD-Updates-64bit.qcow2
     1 | 2024-07-16 18:10:02 | 2024-07-16 18:10:02 | asset failure: Failed to download SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256 to /var/lib/openqa/cache/openqa.suse.de/SLE-15-SP4-Online-s390x-Build208.1-Media1.iso.sha256
     1 | 2024-07-17 01:13:57 | 2024-07-17 01:13:57 | backend died: Error connecting to VNC server <s390kvm097.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
     1 | 2024-07-17 01:07:21 | 2024-07-17 01:07:21 | backend died: Error connecting to VNC server <s390kvm096.oqa.prg2.suse.org:5901>: IO::Socket::INET: connect: No route to host
     1 | 2024-07-16 12:07:21 | 2024-07-16 12:07:21 | tests died: unable to load tests/console/perl_bootloader.pm, check the log for the cause (e.g. syntax error)

Quite some incomplete but not a single "api failure: 403 response: timestamp mismatch" (2 days ago: https://progress.opensuse.org/issues/163592#note-34) - so the auto-review works and restarts affected jobs. We can lower the prio then.

Actions #36

Updated by nicksinger 4 months ago

livdywan wrote in #note-35:

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228 was merged

Thanks!

I had a talk yesterday with @kraih (https://suse.slack.com/archives/C02AJ1E568M/p1721229859258029) and we're actually not sure what will happen if we add this header a second time. I will now closely monitor OSD to see if we encounter problems

Actions #37

Updated by ybonatakis 4 months ago

I tried to restart nginx today and it fails. The logs took me to the conf file which points here.

Jul 19 07:34:35 openqa nginx[28217]: nginx: [emerg] unknown directive "perl_set" in /etc/nginx/nginx.conf:30                                                                                                                                 
Jul 19 07:34:35 openqa nginx[28217]: nginx: configuration file /etc/nginx/nginx.conf test failed                                                                                                                                             
Jul 19 07:34:35 openqa systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Actions #38

Updated by ybonatakis 4 months ago

Nick I manual revert the changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228/diffs in order to restart the nginx and bring back the server. please take a look

Actions #40

Updated by nicksinger 4 months ago

ybonatakis wrote in #note-38:

Nick I manual revert the changes from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1228/diffs in order to restart the nginx and bring back the server. please take a look

Not sure why this was not working. nginx -V |& grep perl shows we have "--with-perl=/usr/bin/perl" but anyhow, might not help because it moves the timestamp generation into an earlier step so not pursuing this route further.

ph03nix wrote in #note-39:

The issue is back:

Thanks, but I have to stress: the issue was never gone, just less worse. Also all the linked jobs you pasted got covered by auto-restart so no manual action needed currently

Actions #41

Updated by nicksinger 4 months ago

  • Status changed from In Progress to Feedback

Created https://github.com/os-autoinst/openQA/pull/5782 which changes behavior and also adds some debug log message with both times in case a mismatch happens. I currently can't do much except moving slowly forward and observing the situation with my changes applied.

Actions #42

Updated by okurz 4 months ago

  • Copied to action #164296: openqa-label-known-issues does not look at known issues if autoinst-log.txt does not exist but reason could be looked at size:S added
Actions #43

Updated by okurz 4 months ago

  • Description updated (diff)
Actions #44

Updated by nicksinger 4 months ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1233 to workaround currently not restarted tests due to https://progress.opensuse.org/issues/164296 - upstream changes to openQA already hot patched on OSD. Waiting for some example outputs now to further pinpoint the issue.

Actions #45

Updated by nicksinger 4 months ago

  • Status changed from Feedback to In Progress

@okurz found https://openqa.suse.de/tests/14992170 - meaning our auto_restart_regex works as expected. Checking the logs of OSD I can see:

openqa:~ # journalctl --since=yesterday | grep "Timestamp mismatch over 300s"
Jul 23 10:30:22 openqa openqa[8091]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723065
Jul 23 10:30:22 openqa openqa[8091]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723065
[…]
Jul 23 10:30:23 openqa openqa[14892]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723120
Jul 23 10:30:23 openqa openqa[14892]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723120
Jul 23 10:30:23 openqa openqa[21151]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723119
Jul 23 10:30:23 openqa openqa[21151]: [debug] Timestamp mismatch over 300s; our_timestamp: 1721723422, X-API-Microtime (from worker): 1721723119

the time perfectly correlates with an experiment to run fstrim @okurz was conducting ( https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=1721722084983&to=1721724906061 ).

I did some crude math on these numbers above with (cat timestamp_mismatch | xargs -I{} -- bash -c "echo '{}' | awk '{ print \$12 \$16 }'" | tr "," "-") | bc -l and found that we have mismatches between 303 and 357 seconds, 353 on average. Going forward I will bump this timeout higher for now as a workaround until we understood and fix the underlying performance issue.

Actions #46

Updated by nicksinger 4 months ago

  • Description updated (diff)
Actions #47

Updated by nicksinger 4 months ago

  • Status changed from In Progress to Feedback

PR to configure the delta between two timestamps: https://github.com/os-autoinst/openQA/pull/5787

Actions #49

Updated by nicksinger 4 months ago

  • Status changed from Feedback to Resolved

I validated that the openQA changes are deployed and applied my config change manually (including restarting services) for now until our pipelines work again. Until now we don't see the new error message which is expected and good. We discussed that this should be sufficient for now and other alerts (e.g. number of new incomplete jobs) should alert us if the situation gets worse.

Actions #50

Updated by okurz 4 months ago

  • Copied to action #164418: Distinguish "timestamp mismatch" from cases where webUI is slow or where clocks are really differing added
Actions #51

Updated by nicksinger 4 months ago

  • Has duplicate deleted (action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S)
Actions #52

Updated by nicksinger 4 months ago

  • Related to action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error message size:S added
Actions #53

Updated by okurz 4 months ago

  • Due date deleted (2024-07-30)
Actions

Also available in: Atom PDF