https://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842022-03-04T11:05:49ZopenSUSE Project Management ToolopenQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4979052022-03-04T11:05:49Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>To me it looks like it was caused by data gaps again:<br>
<a href="https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&tab=alert&viewPanel=84&from=1646285146518&to=1646301111736" class="external">https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&tab=alert&viewPanel=84&from=1646285146518&to=1646301111736</a></p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4980672022-03-04T15:11:55Zmkittlermarius.kittler@suse.com
<ul><li><strong>File</strong> <a href="/attachments/12928">screenshot_20220304_160005.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/12928/screenshot_20220304_160005.png">screenshot_20220304_160005.png</a> added</li><li><strong>File</strong> <a href="/attachments/12931">screenshot_20220304_160755.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/12931/screenshot_20220304_160755.png">screenshot_20220304_160755.png</a> added</li><li><strong>File</strong> <a href="/attachments/12934">screenshot_20220304_160913.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/12934/screenshot_20220304_160913.png">screenshot_20220304_160913.png</a> added</li></ul><p>I've also just had a look. The InfluxDB query is very slot when selecting a time-range like "Last 2 days". Maybe we're collecting too many data points per time. Regardless, it looks like gaps causing this, indeed:</p>
<p><img src="https://progress.opensuse.org/attachments/download/12928/screenshot_20220304_160005.png" alt="" loading="lazy" /></p>
<p>Some other graphs have gaps as well but not all:</p>
<p><img src="https://progress.opensuse.org/attachments/download/12931/screenshot_20220304_160755.png" alt="" loading="lazy" /></p>
<p>The CPU load was quite high from time to time but the HTTP response graph shows no gaps:<br>
<img src="https://progress.opensuse.org/attachments/download/12934/screenshot_20220304_160913.png" alt="" loading="lazy" /></p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4989702022-03-09T10:08:08Zokurzokurz@suse.com
<ul><li><strong>Priority</strong> changed from <i>High</i> to <i>Urgent</i></li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4990062022-03-09T10:12:45Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed" href="/issues/107257">action #107257</a>: [alert][osd] Apache Response Time alert size:M</i> added</li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4990122022-03-09T10:13:09Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-5 priority-high3 closed behind-schedule" href="/issues/96807">action #96807</a>: Web UI is slow and Apache Response Time alert got triggered</i> added</li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4990302022-03-09T10:27:09Zlivdywanliv.dywan@suse.com
<ul><li><strong>Subject</strong> changed from <i>[alert][osd] Apache Response Time alert</i> to <i>[alert][osd] Apache Response Time alert size:M</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/499030/diff?detail_id=471868">diff</a>)</li><li><strong>Status</strong> changed from <i>New</i> to <i>Workable</i></li><li><strong>Assignee</strong> set to <i>tinita</i></li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4990602022-03-09T10:52:30Zokurzokurz@suse.com
<ul><li><strong>Status</strong> changed from <i>Workable</i> to <i>In Progress</i></li></ul><p><a class="user active user-mention" href="https://progress.opensuse.org/users/33482">@tinita</a> I have an idea regarding the apache response alert ticket after looking at the graph. I prepared an MR for the dashboard</p>
<p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662</a></p>
<p>You could look into the apache logs parsing from telegraf.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4990932022-03-09T12:17:44Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>All graphs with gaps are reading from the apache_log table, but the comment <code>Response time measured by the apache proxy [...]</code> suggests that this data comes from the proxy logs and not from apache itself.</p>
<p>I need to find out where to find the proxy and the logs.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4992172022-03-09T21:00:04Zokurzokurz@suse.com
<ul></ul><p>tinita wrote:</p>
<blockquote>
<p>All graphs with gaps are reading from the apache_log table, but the comment <code>Response time measured by the apache proxy [...]</code> suggests that this data comes from the proxy logs and not from apache itself.</p>
<p>I need to find out where to find the proxy and the logs.</p>
</blockquote>
<p>We use apache as the reverse proxy for openQA, so apache == proxy.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4992272022-03-09T22:04:02Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>Created <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664</a> to replace <code>logparser</code> with <code>tail</code></p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=4992722022-03-10T04:11:58Zopenqa_reviewopenqa-review@suse.de
<ul><li><strong>Due date</strong> set to <i>2022-03-24</i></li></ul><p>Setting due date based on mean cycle time of SUSE QE Tools</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5001112022-03-11T10:56:30Zokurzokurz@suse.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-6 status-15 priority-4 priority-default child parent" href="/issues/108209">coordination #108209</a>: [epic] Reduce load on OSD</i> added</li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5001202022-03-11T11:04:21Zokurzokurz@suse.com
<ul></ul><p>In the weekly we extracted <a class="issue tracker-6 status-15 priority-4 priority-default child parent" title="coordination: [epic] Reduce load on OSD (Blocked)" href="https://progress.opensuse.org/issues/108209">#108209</a> into a separate ticket, so all mid- and long-term ideas should go into there. Here we should really focus on short-term mitigations avoiding alerts when our system is still operable (under the known constraints).</p>
<p><a class="user active user-mention" href="https://progress.opensuse.org/users/33482">@tinita</a> try out different log parsing intervals in the telegraf config for apache logs and monitor if the alert still triggers. Maybe <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/662</a> and <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/664</a> are already enough.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5001982022-03-11T12:29:00Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>Increase the interval for tail: <a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665</a></p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5006602022-03-14T17:00:17Ztinitatina.mueller+trick-redmine@suse.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul><p><a href="https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665" class="external">https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/665</a> was merged.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5009302022-03-15T12:41:10Ztinitatina.mueller+trick-redmine@suse.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul><p>So even after the interval change to 30s was merged, we still have gaps (there was a one hour gap this morning, in the middle of a 3 hour timeframe with high load).</p>
<p>But we haven't seen alerts, so I consider this ticket resolved, as we have a followup ticket about the high load.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5009362022-03-15T12:42:16Ztinitatina.mueller+trick-redmine@suse.com
<ul></ul><p>Just out of curiosity I created a grafana dashboard, btw: <a href="https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard" class="external">https://monitor.qa.suse.de/d/1pHb56Lnk/tinas-dashboard</a> which can be interesting to see which type of requests we have and which useragents.</p>
openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=5020912022-03-18T10:30:05Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-3 priority-lowest closed" href="/issues/94111">action #94111</a>: Optimize /api/v1/jobs</i> added</li></ul> openQA Infrastructure - action #107875: [alert][osd] Apache Response Time alert size:Mhttps://progress.opensuse.org/issues/107875?journal_id=6342072023-05-17T11:10:59Zokurzokurz@suse.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-4 status-3 priority-6 priority-high2 closed child" href="/issues/128789">action #128789</a>: [alert] Apache Response Time alert size:M</i> added</li></ul>