action #154021: [alert] Ratio of not restarted multi-machine tests by result - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #154021

closed

[alert] Ratio of not restarted multi-machine tests by result

Added by tinita over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-01-22

Due date:

2024-02-12

% Done:

Estimated time:

Tags:

multi-machine, alert, reactive work

Description

Observation¶

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1

Date: Sun, 21 Jan 2024 21:31:37 +0100
2 firing alert instances
[IMAGE]

 GROUPED BY 

   2 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
Ratio of not restarted multi-machine tests by result alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=50 
Labels
alertname
Ratio of not restarted multi-machine tests by result alert
grafana_folder
Salt
rule_uid
0XohcmfVk
Annotations
message

Investigation hints:

Investigate what caused the ratio to change that significantly
Check https://openqa.suse.de/tests?resultfilter=Failed and look for a correlation
Follow https://progress.opensuse.org/projects/openqatests/wiki/Wiki#Statistical-investigation
Check if the amount of failed jobs stays high for longer or if this was just caused by a single scenario failing as a whole See https://progress.opensuse.org/issues/96191 for details

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by tinita over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by tinita over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Looks like this is good again but we have a gap in the data and after that gap a value of 100 % fail rate (50 % failed + 50 % parallel failed) which looks very weird: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&orgId=1&from=1705754659644&to=1706010204892

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

When looking into failed jobs with parallel dependencies since Saturday it doesn't look too bad in absolute figures:

openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) and clone_id is null where dependency = 2 and t_finished >= '2024-01-20' and result in ('failed') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
 count |                                                                                  array_agg                                                                                   |                  name                  |                   example_test                    
-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------------------------------------------------
    19 | {13314890,13302354,13314880,13314892,13314888,13302360,13302352,13302350,13302310,13314874,13314878,13314882,13302337,13302339,13302362,13302356,13314870,13302341,13314872} | YaST Maintenance Updates - Development | mru-iscsi_client_normal_auth_backstore_fileio_dev
     5 | {13310295,13310164,13310164,13310295,13308014}                                                                                                                               | Maintenance: SLE 15 SP5 HA Incidents   | qam_ha_rolling_upgrade_migration_supportserver
     3 | {13302400,13306361,13302398}                                                                                                                                                 | YaST Maintenance Updates               | mru-iscsi_server_normal_auth_backstore_hdd
     3 | {13311675,13311784,13311713}                                                                                                                                                 | Maintenance - QR - SLE15SP5-SAP        | ha_qdevice_node2
     2 | {13294837,13297632}                                                                                                                                                          | Maintenance: SLE 15 SP4 HA Incidents   | qam_ha_rolling_update_node02
     2 | {13299897,13312377}                                                                                                                                                          | JeOS: Development                      | jeos-nfs-client
     2 | {13314918,13309818}                                                                                                                                                          | Test Security                          | cc_ipsec_client
     1 | {13311651}                                                                                                                                                                   | Maintenance - QR - SLE15SP5-Security   | fips_env_stunnel_client
     1 | {13315685}                                                                                                                                                                   | Maintenance: SLE 15 SP2 HA Incidents   | qam_ha_rolling_update_node02
     1 | {13316822}                                                                                                                                                                   | HA  Development                        | ha_hawk_haproxy_node02_test
     1 | {13316842}

When also considering parallel failed the numbers double or triple depending on the number of jobs in the scenario but I guess it is more useful to think in terms of scenarios (and thus only querying for failed).

But some of the failures look in fact like the network within the SUTs was broken, e.g. https://openqa.suse.de/tests/13314878#step/patch_and_reboot/196. Considering we currently still disrupt the GRE network when booting or shutting down a worker it is unfortunately expected that we currently see a small number of jobs like this. (Of course there are also failures which have likely a non-network related cause, e.g. https://openqa.suse.de/tests/13310295#step/setup/54 and https://openqa.suse.de/tests/13294837#step/suseconnect_scc/37.)

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

Status changed from In Progress to Feedback

Considering the absolute figures are not very high and the graph looks also good again I would not make a big deal of it. It looks like there's no silence to remove.

I've got a response about the scenario: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=mru-iscsi_client_normal_auth_backstore_fileio_dev&version=15-SP1

This is running fine forever in official job group:
https://openqa.suse.de/tests/13310783#next_previous
that should be some misconfigured remanent, as the patching happens in the parent image.It would be good idea if you would have a way to exclude the junk we have in dev groups :slightly_smiling_face: but I cannot put the hand on the fire that we don't have some forgotten test suite without progress ticket there. We have plan to clean up that for sure. I will create some tickets.

So we should probably just ignore jobs in development groups. MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098

~~Not sure about the scenario https://openqa.suse.de/tests/13310295#step/setup/54. Considering other jobs in the Next & Previous tab looks good we probably don't have to look into it.~~ That was just a test job.

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from Feedback to In Progress

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz over 1 year ago · Edited

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098 merged, deploying. Waiting for deployment and update of data in monitoring.

EDIT: We talked about the ecountered failed jobs and we agreed that https://openqa.suse.de/tests/13310295#step/setup/54 is a red herring, a test job triggered by emiura

Actions

Copy link

Updated by okurz over 1 year ago

Tags changed from alert to alert, reactive work, multi-machine

Actions

Copy link

#10

Updated by okurz about 1 year ago

Status changed from Feedback to Resolved

https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&orgId=1 looks good, "failed+parallel_failed" 3+5%, good enough

Actions

Copy link

#11

Updated by okurz about 1 year ago

Status changed from Resolved to In Progress

We forgot to look into the actual problem and had been hit by that again on exactly the same point in time which is Sunday 0300 CET in the morning when machines reboot:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1706353426537&to=1706450852479&viewPanel=24

Actions

Copy link

#12

Updated by openqa_review about 1 year ago

Due date set to 2024-02-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#13

Updated by okurz about 1 year ago

As discussed in daily the problem is because we only consider a timeframe of 24h so it can happen that there are for example just 2 jobs scheduled leading to bad statistics. Instead we could change to number of jobs, e.g. in monitoring/telegraf/telegraf-webui.conf go from

sqlquery="with mm_jobs as (select …) where t_created >= (select timezone('UTC', now()) - interval '24 hour') … as ratio_mm from mm_jobs group by mm_jobs.result"

sqlquery="with mm_jobs as (select …) where … as ratio_mm from mm_jobs group by mm_jobs.result ordery by id limit 1000"

Actions

Copy link

#14

Updated by mkittler about 1 year ago

MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1103

Actions

Copy link

#15

Updated by mkittler about 1 year ago · Edited

Status changed from In Progress to Feedback

The MR was merged. I'll keep the ticket in feedback until next Monday. If the alert isn't firing anymore until then I'll resolve the ticket.

Actions

Copy link

#16

Updated by okurz about 1 year ago

Related to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added

Actions

Copy link

#17

Updated by mkittler about 1 year ago

Status changed from Feedback to Resolved

The alert didn't fire in the last 7 days so I'm resolving the ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #154021

[alert] Ratio of not restarted multi-machine tests by result

Observation¶

Updated by tinita over 1 year ago

Updated by tinita over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by mkittler over 1 year ago · Edited

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago

Updated by okurz over 1 year ago · Edited

Updated by okurz over 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by openqa_review about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago