Project

General

Profile

action #162038

Updated by okurz 4 months ago

## Observation 
 There were many logs for `check whether clocks on the local host and the web UI host are in sync`. 
 Openqa seems unavailable for ~10min 

 ``` 
 Jun 10 00:43:01 openqa systemd[1]: Started Session c46901 of User geekotest.                                                                                                                                                                  
 Jun 10 00:43:01 openqa cron[13285]: pam_unix(crond:session): session opened for user geekotest by (uid=0)                                                                                                                                     
 Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3522 from worker_status (free)                                                                                                            
 Jun 10 00:43:01 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 3464 from worker_status (free)                                                                                                            
 Jun 10 00:43:02 openqa openqa[32518]: [debug] Rejecting authentication for                                                                                                                                                                    
 user "openqaworker4" with ip "10.100.96.76", valid key "...",                                                                                                                                                                    
 secret "...", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync                                                                                                                                                                                                    
 Jun 10 00:43:02 openqa openqa[4972]: [debug] Rejecting authentication for user "openqaworker4" with ip "**", valid key "**", secret "**", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync 
 Jun 10 00:43:02 openqa openqa-websockets-daemon[23898]: [debug] [pid:23898] Updating seen of worker 1952 from worker_status (working)                                                                                                         
 Jun 10 00:43:02 openqa openqa[745]: [debug] Rejecting authentication for user 
 ... 
 Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] Enqueued restarting openQA job 14563096 via Minion job 11664862                                                                                                                       
 Jun 10 00:43:04 openqa openqa[1637]: [debug] [pid:1637] Enqueued restarting openQA job 14561971 via Minion job 11664863                                                                                                                       
 Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557] _carry_over_candidate(14563096): _failure_reason=cleanup_before_shutdown:none,force_scheduled_tasks:none,grub_set_bootargs:none,hostname:none,shutdown:none                           
 Jun 10 00:43:04 openqa openqa[745]: [debug] [pid:745] Enqueued restarting openQA job 14562006 via Minion job 11664864                                                                                                                         
 Jun 10 00:43:04 openqa openqa[32535]: [debug] [pid:32535] _carry_over_candidate(14560798): _failure_reason=consoletest_finish:none,mariadb_srv:none,mtab:none,openvswitch:none,orphaned_packages_check:none,rsync:none,ssh_cleanup:none,sshd$ 
 Jun 10 00:43:04 openqa openqa[6557]: [debug] [pid:6557]                                                                                                                                                                                       
 _carry_over_candidate(14563096): checking take over from 14558523:                                                                                                                                                                            
 _failure_reason=GOOD 
 ``` 
 chrony status looks ok on OSD. 

 I saw also a python3 update and then issues with telegraf but I dont think they are related with OSD. 

 ## Steps to reproduce 

 Find jobs referencing this ticket with the help of 
 https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , 
 call `openqa-query-for-job-label poo#162038` 

 ## Suggestions 
 * Consider outputting the actual "mismatch" value (the time delta exceeding some threshold) in the log message if possible and not hidden in a hashsum (https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Shared/Controller/Auth.pm#L162) 
 * Find out/fix why some jobs fail without a specific error besides "api failure" besides showing timestamp errors in logs 
 * Crosscheck with #163592 if the symptoms actually still appear 
 * Try to simulate slow response times (by putting a sleep at the right place) to reproduce the issue (How slow can we go before "timestamp mismatch" becomes an issue?) 
 * If the real cause for the error message is actually not a time difference but too slow responses by the web UI extend the error message accordingly 

 ## Out of scope 
 * Add details to bare "api failure" #163781 
 * Teaching openqa-label-known-issues to look into reason when autoinst-log.txt #164296

Back