Project

General

Profile

Actions

action #162038

open

No HTTP Response on OSD on 10-06-2024 - auto_review:".*timestamp mismatch - check whether clocks on the local host and the web UI host are in sync":retry size:S

Added by okurz about 1 month ago. Updated about 3 hours ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-10
Due date:
2024-07-30 (Due in 13 days)
% Done:

0%

Estimated time:

Description

Observation

There were many logs for check whether clocks on the local host and the web UI host are in sync.
Openqa seems unavailable for ~10min

Jun 10 00:43:01 openqa systemd[1]: Started Session c46901 of User geekotest.                                                                                                                                                                 
Jun 10 00:43:01 openqa cron[13285]: pam_unix(crond:session): session opened for user geekotest by (uid=0)                                                                                                                                    
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3522 from worker_status (free)                                                                                                           
Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3464 from worker_status (free)                                                                                                           
Jun 10 00:43:02 openqa openqa[32518]: [debug] Rejecting authentication for                                                                                                                                                                   
user "openqaworker4" with ip "10.100.96.76", valid key "...",                                                                                                                                                                   
secret "...", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync                                                                                                                                                                                                   
Jun 10 00:43:02 openqa openqa[4972]: [debug] Rejecting authentication for user "openqaworker4" with ip "**", valid key "**", secret "**", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Jun 10 00:43:02 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 1952 from worker_status (working)                                                                                                        
Jun 10 00:43:02 openqa openqa[745]: [debug] Rejecting authentication for user
...
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] Enqueued restarting openQA job 14563096 via Minion job 11664862                                                                                                                      
Jun 10 00:43:04 openqa openqa[1637]: [debug] [pid:1637] Enqueued restarting openQA job 14561971 via Minion job 11664863                                                                                                                      
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] _carry_over_candidate(14563096): _failure_reason=cleanup_before_shutdown:none,force_scheduled_tasks:none,grub_set_bootargs:none,hostname:none,shutdown:none                          
Jun 10 00:43:04 openqa openqa[745]: [debug] [pid:745] Enqueued restarting openQA job 14562006 via Minion job 11664864                                                                                                                        
Jun 10 00:43:04 openqa openqa[32535]: [debug] [pid:32535] _carry_over_candidate(14560798): _failure_reason=consoletest_finish:none,mariadb_srv:none,mtab:none,openvswitch:none,orphaned_packages_check:none,rsync:none,ssh_cleanup:none,sshd$
Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557]                                                                                                                                                                                      
_carry_over_candidate(14563096): checking take over from 14558523:                                                                                                                                                                           
_failure_reason=GOOD

chrony status looks ok on OSD.

I saw also a python3 update and then issues with telegraf but I dont think they are related with OSD.

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#162038

Suggestions

  • Consider outputting the actual "mismatch" value (the time delta exceeding some threshold) in the log message if possible and not hidden in a hashsum (https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/Controller/Auth.pm#L162)
  • Find out/fix why some jobs fail without a specific error besides "api failure" besides showing timestamp errors in logs
  • Crosscheck with #163592 if the symptoms actually still appear
  • Try to simulate slow response times (by putting a sleep at the right place) to reproduce the issue (How slow can we go before "timestamp mismatch" becomes an issue?)
  • If the real cause for the error message is actually not a time difference but too slow responses by the web UI extend the error message accordingly

Out of scope

  • Add details to bare "api failure" #163781

Related issues 4 (2 open2 closed)

Related to openQA Tests - action #162644: Job fails with api failure: 403 response: timestamp mismatch after brief osd outageRejectedokurz2024-06-20

Actions
Related to openQA Infrastructure - action #159396: Repeated HTTP Response alert for /tests and unresponsiveness due to potential detrimental impact of pg_dump (was: HTTP Response alert for /tests briefly going up to 15.7s) size:MResolveddheidler2024-07-16

Actions
Related to openQA Infrastructure - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MIn Progressokurz2024-07-102024-07-27

Actions
Has duplicate openQA Project - action #163781: Jobs randomly fail with unspecified "api failure", there should be more details in the error messageNew2024-07-11

Actions
Actions

Also available in: Atom PDF