action #154021
closed[alert] Ratio of not restarted multi-machine tests by result
0%
Description
Observation¶
https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1
Date: Sun, 21 Jan 2024 21:31:37 +0100
2 firing alert instances
[IMAGE]
GROUPED BY
2 firing instances
Firing [stats.openqa-monitor.qa.suse.de]
Ratio of not restarted multi-machine tests by result alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=50
Labels
alertname
Ratio of not restarted multi-machine tests by result alert
grafana_folder
Salt
rule_uid
0XohcmfVk
Annotations
message
Investigation hints:
- Investigate what caused the ratio to change that significantly
- Check https://openqa.suse.de/tests?resultfilter=Failed and look for a correlation
- Follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
- Check if the amount of failed jobs stays high for longer or if this was just caused by a single scenario failing as a whole See https://progress.opensuse.org/issues/96191 for details
Updated by mkittler 10 months ago
- Status changed from New to In Progress
- Assignee set to mkittler
Looks like this is good again but we have a gap in the data and after that gap a value of 100 % fail rate (50 % failed + 50 % parallel failed) which looks very weird: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&orgId=1&from=1705754659644&to=1706010204892
Updated by mkittler 10 months ago · Edited
When looking into failed jobs with parallel dependencies since Saturday it doesn't look too bad in absolute figures:
openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) and clone_id is null where dependency = 2 and t_finished >= '2024-01-20' and result in ('failed') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
count | array_agg | name | example_test
-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------------------------------------------------
19 | {13314890,13302354,13314880,13314892,13314888,13302360,13302352,13302350,13302310,13314874,13314878,13314882,13302337,13302339,13302362,13302356,13314870,13302341,13314872} | YaST Maintenance Updates - Development | mru-iscsi_client_normal_auth_backstore_fileio_dev
5 | {13310295,13310164,13310164,13310295,13308014} | Maintenance: SLE 15 SP5 HA Incidents | qam_ha_rolling_upgrade_migration_supportserver
3 | {13302400,13306361,13302398} | YaST Maintenance Updates | mru-iscsi_server_normal_auth_backstore_hdd
3 | {13311675,13311784,13311713} | Maintenance - QR - SLE15SP5-SAP | ha_qdevice_node2
2 | {13294837,13297632} | Maintenance: SLE 15 SP4 HA Incidents | qam_ha_rolling_update_node02
2 | {13299897,13312377} | JeOS: Development | jeos-nfs-client
2 | {13314918,13309818} | Test Security | cc_ipsec_client
1 | {13311651} | Maintenance - QR - SLE15SP5-Security | fips_env_stunnel_client
1 | {13315685} | Maintenance: SLE 15 SP2 HA Incidents | qam_ha_rolling_update_node02
1 | {13316822} | HA Development | ha_hawk_haproxy_node02_test
1 | {13316842}
When also considering parallel failed the numbers double or triple depending on the number of jobs in the scenario but I guess it is more useful to think in terms of scenarios (and thus only querying for failed).
But some of the failures look in fact like the network within the SUTs was broken, e.g. https://openqa.suse.de/tests/13314878#step/patch_and_reboot/196. Considering we currently still disrupt the GRE network when booting or shutting down a worker it is unfortunately expected that we currently see a small number of jobs like this. (Of course there are also failures which have likely a non-network related cause, e.g. https://openqa.suse.de/tests/13310295#step/setup/54 and https://openqa.suse.de/tests/13294837#step/suseconnect_scc/37.)
Updated by mkittler 10 months ago · Edited
- Status changed from In Progress to Feedback
Considering the absolute figures are not very high and the graph looks also good again I would not make a big deal of it. It looks like there's no silence to remove.
I've got a response about the scenario: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=mru-iscsi_client_normal_auth_backstore_fileio_dev&version=15-SP1
This is running fine forever in official job group:
https://openqa.suse.de/tests/13310783#next_previous
that should be some misconfigured remanent, as the patching happens in the parent image.It would be good idea if you would have a way to exclude the junk we have in dev groups :slightly_smiling_face: but I cannot put the hand on the fire that we don't have some forgotten test suite without progress ticket there. We have plan to clean up that for sure. I will create some tickets.
So we should probably just ignore jobs in development groups. MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098
Not sure about the scenario https://openqa.suse.de/tests/13310295#step/setup/54. Considering other jobs in the Next & Previous tab looks good we probably don't have to look into it. That was just a test job.
Updated by okurz 10 months ago · Edited
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098 merged, deploying. Waiting for deployment and update of data in monitoring.
EDIT: We talked about the ecountered failed jobs and we agreed that https://openqa.suse.de/tests/13310295#step/setup/54 is a red herring, a test job triggered by emiura
Updated by okurz 10 months ago
- Status changed from Feedback to Resolved
https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&orgId=1 looks good, "failed+parallel_failed" 3+5%, good enough
Updated by okurz 10 months ago
- Status changed from Resolved to In Progress
We forgot to look into the actual problem and had been hit by that again on exactly the same point in time which is Sunday 0300 CET in the morning when machines reboot:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1706353426537&to=1706450852479&viewPanel=24
Updated by openqa_review 10 months ago
- Due date set to 2024-02-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 10 months ago
As discussed in daily the problem is because we only consider a timeframe of 24h so it can happen that there are for example just 2 jobs scheduled leading to bad statistics. Instead we could change to number of jobs, e.g. in monitoring/telegraf/telegraf-webui.conf go from
sqlquery="with mm_jobs as (select …) where t_created >= (select timezone('UTC', now()) - interval '24 hour') … as ratio_mm from mm_jobs group by mm_jobs.result"
sqlquery="with mm_jobs as (select …) where … as ratio_mm from mm_jobs group by mm_jobs.result ordery by id limit 1000"
Updated by okurz 10 months ago
- Related to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added