Project

General

Profile

action #155326

Updated by mkittler 8 months ago

## Observation 
 https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=1707692628142&to=1707697161692&editPanel=78 

 There was no restart of the OSD VM in that timeframe. 

 There are some gaps in the logs (maybe not related) and also some DBIx error: 
 ``` 
 Feb 12 00:29:33 openqa openqa[10051]: [debug] [pid:10051] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub 
 Feb 12 00:29:34 openqa openqa[10051]: [debug] [pid:10051] suse.openqa.job.done published 
 Feb 12 00:35:11 openqa openqa[6188]: [debug] Rejecting authentication for user "openqaworker4" with ip "…", valid key "…", secret "…", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync 
 Feb 12 00:35:11 openqa openqa[9325]: [debug] Rejecting authentication for user "openqaworker4" with ip "…", valid key "…", secret "…", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync 
 … 
 Feb 12 00:35:11 openqa openqa[8593]: [debug] redirect to /assets/hdd/SLE-15-SP3-x86_64-Build20240211-1-qam-sles4sap-gnome.qcow2 
 Feb 12 00:35:12 openqa openqa[12728]: [debug] redirect to /assets/hdd/SLE-15-SP3-x86_64-Build20240211-1-qam-sles4sap-gnome.qcow2 
 Feb 12 00:38:11 openqa openqa[8518]: [debug] redirect to /assets/iso/SLE-15-SP3-Online-x86_64-GMC-Media1.iso 
 Feb 12 00:38:11 openqa openqa[4016]: [debug] redirect to /assets/iso/SLE-15-SP3-Online-x86_64-GMC-Media1.iso 
 Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] _carry_over_candidate(13499037): _failure_reason=GOOD 
 Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] Sending AMQP event: suse.openqa.job.done 
 Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub 
 Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] suse.openqa.job.done published 
 Feb 12 00:38:12 openqa openqa[11635]: [error] [ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139 
 Feb 12 00:38:15 openqa openqa[4016]: [debug] [pid:4016] _carry_over_candidate(13498958): _failure_reason=GOOD 
 Feb 12 00:38:15 openqa openqa[4016]: [debug] [pid:4016] Sending AMQP event: suse.openqa.job.done 
 … 
 Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] _carry_over_candidate(13498608): found a good candidate (13487198) 
 Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] Sending AMQP event: suse.openqa.job.done 
 Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub 
 Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] suse.openqa.job.done published 
 Feb 12 00:41:15 openqa openqa[4016]: [debug] [pid:4016] suse.openqa.job.done published 
 Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): _failure_reason=trustedprograms:softfailed 
 Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): checking take over from 13487446: _failure_reason=trustedprograms:softfailed 
 Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): found a good candidate (13487446) 
 Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] Sending AMQP event: suse.openqa.job.done 
 … 
 Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] Sending AMQP event: suse.openqa.job.done 
 Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub 
 Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] suse.openqa.job.done published 
 Feb 12 00:44:14 openqa openqa[15227]: [info] Worker 4100 stopped 
 Feb 12 00:44:14 openqa openqa[31597]: [info] Worker 31597 started 
 ``` 

 There was also a high number of Minion jobs in that timeframe: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1707681536056&to=1707702390386 

 ## Suggestions 
 * Look at other monitoring panels from the same time, e.g. returned database rows shows a suspicious spike 
 * Lookup older tickets about "too many workers uploading" or similar 
 * Closely check all log files including database server logs, specifically for the big 25M row request seen in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1707693500384&to=1707693665688&viewPanel=89 
 * Specifically look into the only error in the log excerpt "[ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139" 
 * Find an acceptable hypothesis for what happened causing the unresponsiveness 
 * Plan according improvements

Back