Actions
action #155326
closed[alert] "HTTP Response" alert fired shortly on 2024-02-12 and 2024-03-04 size:M
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-02-12
Due date:
% Done:
0%
Estimated time:
Tags:
Description
Observation¶
There was no restart of the OSD VM in that timeframe.
There are some gaps in the logs (maybe not related) and also some DBIx error:
Feb 12 00:29:33 openqa openqa[10051]: [debug] [pid:10051] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub
Feb 12 00:29:34 openqa openqa[10051]: [debug] [pid:10051] suse.openqa.job.done published
Feb 12 00:35:11 openqa openqa[6188]: [debug] Rejecting authentication for user "openqaworker4" with ip "…", valid key "…", secret "…", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
Feb 12 00:35:11 openqa openqa[9325]: [debug] Rejecting authentication for user "openqaworker4" with ip "…", valid key "…", secret "…", timestamp mismatch - check whether clocks on the local host and the web UI host are in sync
…
Feb 12 00:35:11 openqa openqa[8593]: [debug] redirect to /assets/hdd/SLE-15-SP3-x86_64-Build20240211-1-qam-sles4sap-gnome.qcow2
Feb 12 00:35:12 openqa openqa[12728]: [debug] redirect to /assets/hdd/SLE-15-SP3-x86_64-Build20240211-1-qam-sles4sap-gnome.qcow2
Feb 12 00:38:11 openqa openqa[8518]: [debug] redirect to /assets/iso/SLE-15-SP3-Online-x86_64-GMC-Media1.iso
Feb 12 00:38:11 openqa openqa[4016]: [debug] redirect to /assets/iso/SLE-15-SP3-Online-x86_64-GMC-Media1.iso
Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] _carry_over_candidate(13499037): _failure_reason=GOOD
Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] Sending AMQP event: suse.openqa.job.done
Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub
Feb 12 00:38:12 openqa openqa[9325]: [debug] [pid:9325] suse.openqa.job.done published
Feb 12 00:38:12 openqa openqa[11635]: [error] [ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139
Feb 12 00:38:15 openqa openqa[4016]: [debug] [pid:4016] _carry_over_candidate(13498958): _failure_reason=GOOD
Feb 12 00:38:15 openqa openqa[4016]: [debug] [pid:4016] Sending AMQP event: suse.openqa.job.done
…
Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] _carry_over_candidate(13498608): found a good candidate (13487198)
Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] Sending AMQP event: suse.openqa.job.done
Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub
Feb 12 00:38:22 openqa openqa[10955]: [debug] [pid:10955] suse.openqa.job.done published
Feb 12 00:41:15 openqa openqa[4016]: [debug] [pid:4016] suse.openqa.job.done published
Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): _failure_reason=trustedprograms:softfailed
Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): checking take over from 13487446: _failure_reason=trustedprograms:softfailed
Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] _carry_over_candidate(13498858): found a good candidate (13487446)
Feb 12 00:41:15 openqa openqa[13173]: [debug] [pid:13173] Sending AMQP event: suse.openqa.job.done
…
Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] Sending AMQP event: suse.openqa.job.done
Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] AMQP URL: amqps://openqa:…@rabbit.suse.de:5671/?exchange=pubsub
Feb 12 00:41:44 openqa openqa[5819]: [debug] [pid:5819] suse.openqa.job.done published
Feb 12 00:44:14 openqa openqa[15227]: [info] Worker 4100 stopped
Feb 12 00:44:14 openqa openqa[31597]: [info] Worker 31597 started
There was also a high number of Minion jobs in that timeframe: https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1707681536056&to=1707702390386
Suggestions¶
- Look at other monitoring panels from the same time, e.g. returned database rows shows a suspicious spike
- Lookup older tickets about "too many workers uploading" or similar
- Closely check all log files including database server logs, specifically for the big 25M row request seen in https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1707693500384&to=1707693665688&viewPanel=89
- Specifically look into the only error in the log excerpt "[ztQJ1_pAsMiS] DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139"
- Find an acceptable hypothesis for what happened causing the unresponsiveness
- Plan according improvements
Updated by okurz 8 months ago
- Copied to action #156754: "DBIx::Class::Row::update(): Can't update OpenQA::Schema::Result::JobLocks=HASH(0x55b77ea45e28): row not found at /usr/share/openqa/script/../lib/OpenQA/Resource/Locks.pm line 139" size:S added
Actions