Project

General

Profile

Actions

action #154021

closed

[alert] Ratio of not restarted multi-machine tests by result

Added by tinita 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-01-22
Due date:
2024-02-12
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/alerting/grafana/0XohcmfVk/view?orgId=1

Date: Sun, 21 Jan 2024 21:31:37 +0100
2 firing alert instances
[IMAGE]

 GROUPED BY 

   2 firing instances

Firing [stats.openqa-monitor.qa.suse.de]
Ratio of not restarted multi-machine tests by result alert
View alert [stats.openqa-monitor.qa.suse.de]
Values
A0=50 
Labels
alertname
Ratio of not restarted multi-machine tests by result alert
grafana_folder
Salt
rule_uid
0XohcmfVk
Annotations
message

Investigation hints:


Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
Actions #1

Updated by tinita 3 months ago

  • Description updated (diff)
Actions #2

Updated by tinita 3 months ago

  • Description updated (diff)
Actions #3

Updated by mkittler 3 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

Looks like this is good again but we have a gap in the data and after that gap a value of 100 % fail rate (50 % failed + 50 % parallel failed) which looks very weird: https://stats.openqa-monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=24&orgId=1&from=1705754659644&to=1706010204892

Actions #4

Updated by mkittler 3 months ago · Edited

When looking into failed jobs with parallel dependencies since Saturday it doesn't look too bad in absolute figures:

openqa=> select distinct count(jobs.id), array_agg(jobs.id), (select name from job_groups where id = group_id), (array_agg(test))[1] as example_test from jobs left join job_dependencies on (id = child_job_id or id = parent_job_id) and clone_id is null where dependency = 2 and t_finished >= '2024-01-20' and result in ('failed') and test not like '%:investigate:%' group by group_id order by count(jobs.id) desc;
 count |                                                                                  array_agg                                                                                   |                  name                  |                   example_test                    
-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------+---------------------------------------------------
    19 | {13314890,13302354,13314880,13314892,13314888,13302360,13302352,13302350,13302310,13314874,13314878,13314882,13302337,13302339,13302362,13302356,13314870,13302341,13314872} | YaST Maintenance Updates - Development | mru-iscsi_client_normal_auth_backstore_fileio_dev
     5 | {13310295,13310164,13310164,13310295,13308014}                                                                                                                               | Maintenance: SLE 15 SP5 HA Incidents   | qam_ha_rolling_upgrade_migration_supportserver
     3 | {13302400,13306361,13302398}                                                                                                                                                 | YaST Maintenance Updates               | mru-iscsi_server_normal_auth_backstore_hdd
     3 | {13311675,13311784,13311713}                                                                                                                                                 | Maintenance - QR - SLE15SP5-SAP        | ha_qdevice_node2
     2 | {13294837,13297632}                                                                                                                                                          | Maintenance: SLE 15 SP4 HA Incidents   | qam_ha_rolling_update_node02
     2 | {13299897,13312377}                                                                                                                                                          | JeOS: Development                      | jeos-nfs-client
     2 | {13314918,13309818}                                                                                                                                                          | Test Security                          | cc_ipsec_client
     1 | {13311651}                                                                                                                                                                   | Maintenance - QR - SLE15SP5-Security   | fips_env_stunnel_client
     1 | {13315685}                                                                                                                                                                   | Maintenance: SLE 15 SP2 HA Incidents   | qam_ha_rolling_update_node02
     1 | {13316822}                                                                                                                                                                   | HA  Development                        | ha_hawk_haproxy_node02_test
     1 | {13316842}

When also considering parallel failed the numbers double or triple depending on the number of jobs in the scenario but I guess it is more useful to think in terms of scenarios (and thus only querying for failed).

But some of the failures look in fact like the network within the SUTs was broken, e.g. https://openqa.suse.de/tests/13314878#step/patch_and_reboot/196. Considering we currently still disrupt the GRE network when booting or shutting down a worker it is unfortunately expected that we currently see a small number of jobs like this. (Of course there are also failures which have likely a non-network related cause, e.g. https://openqa.suse.de/tests/13310295#step/setup/54 and https://openqa.suse.de/tests/13294837#step/suseconnect_scc/37.)

Actions #5

Updated by mkittler 3 months ago · Edited

  • Status changed from In Progress to Feedback

Considering the absolute figures are not very high and the graph looks also good again I would not make a big deal of it. It looks like there's no silence to remove.


I've got a response about the scenario: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Updates&machine=64bit&test=mru-iscsi_client_normal_auth_backstore_fileio_dev&version=15-SP1

This is running fine forever in official job group:
https://openqa.suse.de/tests/13310783#next_previous
that should be some misconfigured remanent, as the patching happens in the parent image.It would be good idea if you would have a way to exclude the junk we have in dev groups :slightly_smiling_face: but I cannot put the hand on the fire that we don't have some forgotten test suite without progress ticket there. We have plan to clean up that for sure. I will create some tickets.

So we should probably just ignore jobs in development groups. MR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098

Not sure about the scenario https://openqa.suse.de/tests/13310295#step/setup/54. Considering other jobs in the Next & Previous tab looks good we probably don't have to look into it. That was just a test job.

Actions #6

Updated by mkittler 3 months ago

  • Status changed from Feedback to In Progress
Actions #7

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by okurz 3 months ago · Edited

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1098 merged, deploying. Waiting for deployment and update of data in monitoring.

EDIT: We talked about the ecountered failed jobs and we agreed that https://openqa.suse.de/tests/13310295#step/setup/54 is a red herring, a test job triggered by emiura

Actions #9

Updated by okurz 3 months ago

  • Tags changed from alert to alert, reactive work, multi-machine
Actions #10

Updated by okurz 3 months ago

  • Status changed from Feedback to Resolved
Actions #11

Updated by okurz 3 months ago

  • Status changed from Resolved to In Progress

We forgot to look into the actual problem and had been hit by that again on exactly the same point in time which is Sunday 0300 CET in the morning when machines reboot:
https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?orgId=1&from=1706353426537&to=1706450852479&viewPanel=24

Actions #12

Updated by openqa_review 3 months ago

  • Due date set to 2024-02-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by okurz 3 months ago

As discussed in daily the problem is because we only consider a timeframe of 24h so it can happen that there are for example just 2 jobs scheduled leading to bad statistics. Instead we could change to number of jobs, e.g. in monitoring/telegraf/telegraf-webui.conf go from

sqlquery="with mm_jobs as (select …) where t_created >= (select timezone('UTC', now()) - interval '24 hour') … as ratio_mm from mm_jobs group by mm_jobs.result"
sqlquery="with mm_jobs as (select …) where … as ratio_mm from mm_jobs group by mm_jobs.result ordery by id limit 1000"
Actions #15

Updated by mkittler 3 months ago · Edited

  • Status changed from In Progress to Feedback

The MR was merged. I'll keep the ticket in feedback until next Monday. If the alert isn't firing anymore until then I'll resolve the ticket.

Actions #16

Updated by okurz 3 months ago

  • Related to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added
Actions #17

Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

The alert didn't fire in the last 7 days so I'm resolving the ticket.

Actions

Also available in: Atom PDF