action #129065
Updated by mkittler over 1 year ago
## Observation https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683722604024&to=1683725326412&viewPanel=78 alerted on 2023-05-10 15:07 CEST ## Acceptance criteria * **AC1**: The alert is not firing anymore. * **AC2**: Logs have been investigated. ## Suggestions * Look into the timeframe https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=1683723624920&to=1683724305517 and compare to other panels on OSD if it's visible what made the system busy DONE: nothing too unusual. Maybe a little high IO times but far from concerning * @okurz suggested in https://suse.slack.com/archives/C02AJ1E568M/p1683724668733689?thread_ts=1683724103.321589&cid=C02AJ1E568M that it might be caused by something we don't collect metrics from - brainstorm what these could be, implement metrics for them * Open network connections - nsinger observed peaks of >2k, ~75% of them related to httpd-prefork, ~20% to openqa-websocket * > (Nick Singer) I'm currently logged into OSD. CPU utilization is quite high with a longterm load of 12 and shortterm of ~14 with only 12 cores on OSD. velociraptor goes up to 200% and is in general quite high in the process list but also telegraf and obviously openqa itself. > (Oliver Kurz) all of that sounds fine. When the HTTP response was high I just took a look and the CPU usage was near 0 same as we suspected in the past. Remember our debugging on why qanet is slow? Comparable to that but here it's likely apache, number of concurrent connections, something like that * Take https://suse.slack.com/archives/C02CANHLANP/p1683723956965209 into account - is there something we can do to improve this situation? > (Joaquin Rivera) is OSD also slow for someone else? (edited) > (Fabian Vogt) That might be partially because of the yast2_nfs_server jobs for investigation. You might want to delete them now that they did their job. (e.g. https://openqa.suse.de/tests/11085729. Don't open, might crash your browser...). those jobs are special. serial_terminal has some race condition so they hammer enter_cmd + assert_script_run in a loop until it fails ## Out of scope * limiting the number of test result steps uploads or handling the effect of test result steps uploading -> #129068