Project

General

Profile

Actions

action #114694

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #117694: [epic] Stable and reliable qem-bot

Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M

Added by okurz over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Start date:
2022-07-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

Why does http://dashboard.qam.suse.de/incident/25171 show no aggregates?

I assume this is why https://build.suse.de/request/show/276375 wasn't yet approved by qam-openqa. But for example https://openqa.suse.de/admin/productlog?id=952035 mentions the incident so aggregate openQA jobs do exist. Also see
https://suse.slack.com/archives/C02AJ1E568M/p1658835230635849

Expected result

  • For every incident an entry should show up in https://dashboard.qam.suse.de
  • Every incident in https://dashboard.qam.suse.de incident + aggregate tests are triggered
  • Results from incident + aggregate tests show up on the dashboard
  • If there is a non-zero amount of related openQA jobs and none of them failed then qem-bot approves in IBS

Acceptance criteria

  • AC1: There are no more aggregate jobs missing for new incidents

Suggestions


Related issues 3 (0 open3 closed)

Related to QA (public) - action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:MResolvedkraih2022-04-28

Actions
Related to openQA Project (public) - action #109310: qem-bot/dashboard - mixed old and new incidents size:MResolvedkraih2022-03-31

Actions
Related to QA (public) - action #117619: Bot approved update request with failing tests size:MResolvedtinita

Actions
Actions #1

Updated by osukup over 2 years ago

from logs everything looks ok , but in database:

dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP3';
 count 
-------
   803
(1 row)

dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP4';
 count 
-------
     0
(1 row)

--> https://github.com/openSUSE/qem-bot/pull/54 - for logging result of PUT request..

Actions #2

Updated by osukup over 2 years ago

manually pushed data to database ( using python3> requests.put(url, headers=token,json=data["qem"]) ) with data parsed from gitlab log, all went OK so probably update of dashboard database during run went into hidden problems

Actions #3

Updated by osukup over 2 years ago

--> so we need add retry based on status to post_qem method

Actions #4

Updated by livdywan over 2 years ago

  • Subject changed from Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists to Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by osukup over 2 years ago

  • Assignee set to osukup
Actions #6

Updated by osukup over 2 years ago

  • Status changed from Workable to In Progress
Actions #7

Updated by openqa_review over 2 years ago

  • Due date set to 2022-08-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by osukup over 2 years ago

  • Status changed from In Progress to Feedback

merged both changes -> retry on all requests use + logging result of PUT operation

Actions #9

Updated by okurz over 2 years ago

  • Priority changed from High to Immediate

Since these changes are live I see that the "schedule" step fails repeatedly, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1078520#L123

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  [Previous line repeated 2 more times]
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 807, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 7, in <module>
    main()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 41, in main
    sys.exit(cfg.func(cfg))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 24, in do_incident_schedule
    bot = OpenQABot(args)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqabot.py", line 23, in __init__
    self.incidents = get_incidents(self.token)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 42, in get_incidents
    xs.append(Incident(i))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 70, in __init__
    self.revisions = self._rev(self.channels, self.project)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 95, in _rev
    max_rev = get_max_revision(lrepos, archver.arch, project)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 46, in get_max_revision
    raise e
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 35, in get_max_revision
    root = ET.fromstring(requests.get(url).text)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 507, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))

please check with urgency if this is a regression and fix it or investigate the different, new failure. Until then I have disabled both the schedule steps "updates only schedule" and "incidents only schedule" on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules.

Actions #11

Updated by livdywan over 2 years ago

osukup wrote:

https://github.com/openSUSE/qem-bot/pull/58

This was a very fast PR and review. You guys are awesome 😁️

Actions #12

Updated by osukup over 2 years ago

follow up https://github.com/openSUSE/qem-bot/pull/59

after I looked how is backoff_factor used in urllib3

Actions #13

Updated by okurz over 2 years ago

  • Priority changed from Immediate to High

"schedule incidents" passed in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1079141, re-enabled schedule for both incidents and aggregates

Actions #14

Updated by mgrifalconi over 2 years ago

Hello, I see the ticket is on 'feedback' state but there are still several update requests with only an incident green box and not being auto approved. http://dashboard.qam.suse.de/blocked

Few examples:
http://dashboard.qam.suse.de/incident/24723
http://dashboard.qam.suse.de/incident/24743
http://dashboard.qam.suse.de/incident/24762

Actions #15

Updated by jbaier_cz over 2 years ago

  • Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added
Actions #16

Updated by jbaier_cz over 2 years ago

  • Related to action #109310: qem-bot/dashboard - mixed old and new incidents size:M added
Actions #17

Updated by mgrifalconi over 2 years ago

Aggregates disappeared again, few minutes ago. Before that they were showing up correctly.
They disappeared just after every single aggregate got green :(
Hope that this was not the cause, since it does not happens often :P

Actions #18

Updated by kraih over 2 years ago

I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb

Actions #19

Updated by livdywan over 2 years ago

Discussed briefly in the in the Unblock. It would be great if others take a look and review the current code for flaws i.e. pretend it's a new pull request and we can then see if that helps us find some gaps or ideas where to improve logging. A the latest tomorrow in the mob session we can discuss it, or comment here earlier

Actions #20

Updated by kraih over 2 years ago

kraih wrote:

I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb

We've investigated this issue during the mob session today. A regression in the cleanup code has been ruled out, there are no log entries for the very recent incident 25413. We have tracked it down to a call of this code from the qem-bot. The bot thinks the update_settings have been added to the database, but they do not appear to have ever been added. The next step will be to find out if the problem is on the bot or dashboard side.

Actions #21

Updated by livdywan over 2 years ago

  • Due date changed from 2022-08-12 to 2022-08-26

Bumping the due date in accordance with on-going research.

Actions #22

Updated by osukup over 2 years ago

We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on

needs coop with openqa-qam reviewers ( @mgrifalconi ?)

Actions #23

Updated by okurz over 2 years ago

We do not have a good way to find "incidents with missing aggregate tests" programmatically so we are relying on users to tell us about further suspicious cases. If anybody finds cases where you assume there are missing aggregate tests please tell us and we can take a look into it.

Actions #25

Updated by livdywan over 2 years ago

  • Due date changed from 2022-08-26 to 2022-09-02

I guess this still needs to be validated

Actions #26

Updated by kraih over 2 years ago

I have a new suspicion where the problem could be, but to confirm that i need one concrete example of an incident with missing aggregate jobs from the past few days.

Actions #27

Updated by livdywan over 2 years ago

osukup wrote:

We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on

https://github.com/openSUSE/qem-bot/pull/64/files

Actions #28

Updated by livdywan over 2 years ago

  • Due date changed from 2022-09-02 to 2022-09-23
  • Assignee changed from osukup to kraih

We discussed that we'll try to spot actual examples which might be easier if some of us join UV temporarily (proposed elsewhere), make sure we confirm the problem and Sebastian will ideally look into the fix

Actions #29

Updated by kraih over 2 years ago

Since there's not been much progress recently, i'll look into a possible solution later this week.

Actions #30

Updated by kraih over 2 years ago

  • Status changed from Feedback to In Progress
Actions #31

Updated by kraih over 2 years ago

My suspicion turned out to be a definitive bug. Aggregate jobs could be deleted by accident if update_settings are being used for longer than 90 days. And the solution i had in mind looks promising in preliminary tests. So i'll try to add some unit tests too and then deploy it to production.

Actions #32

Updated by kraih over 2 years ago

Committed a possible fix. Now we'll have to keep an eye on it again. https://github.com/openSUSE/qem-dashboard/commit/9cf6e655007fad2a366c9a9b4bf6f0f353de69fd

Actions #33

Updated by kraih over 2 years ago

  • Status changed from In Progress to Feedback

The change has been deployed via pipeline 10 hours ago and is now in production.

Actions #34

Updated by okurz about 2 years ago

  • Due date changed from 2022-09-23 to 2022-10-14

Regarding the ACs I tried select * from job_settings where key ~ 'TEST_REPOS' and value ~ '25171' limit 10; but have not found any results. I wonder, can we re-add results so that they show up where they should?

Actions #35

Updated by jbaier_cz about 2 years ago

  • Related to action #117619: Bot approved update request with failing tests size:M added
Actions #36

Updated by okurz about 2 years ago

  • Parent task set to #91646
Actions #37

Updated by okurz about 2 years ago

  • Parent task changed from #91646 to #117694
Actions #38

Updated by kraih about 2 years ago

okurz wrote:

Regarding the ACs I tried select * from job_settings where key ~ 'TEST_REPOS' and value ~ '25171' limit 10; but have not found any results. I wonder, can we re-add results so that they show up where they should?

That data would have to come from the bot, not sure if it is possible. Probably not.

Actions #39

Updated by kraih about 2 years ago

  • Description updated (diff)
Actions #40

Updated by kraih about 2 years ago

I've simplified the ACs, since bringing back old data would be very hard, if it is possible at all.

Actions #41

Updated by kraih about 2 years ago

  • Status changed from Feedback to In Progress
Actions #42

Updated by kraih about 2 years ago

I went through the incident data currently in the dashboard and it mostly looks fine. There are a few suspicious ones in the blocked list, one was new enough to have been created after the fix was deployed. Incident 26186 (Smelt) had no data at all yet. From the bot logs in the pipelines it looked like jobs were being scheduled though:

INFO: not scheduling: Flavor: Leap-DVD-Incidents, version: 15.3 incident: 26186 , arch: x86_64  - exists in openQA
 {'approved': False,
  'channels': ['SUSE:Updates:SLE-Manager-Tools:15:x86_64',
               'SUSE:Updates:SLE-Manager-Tools:15:s390x',
               'SUSE:Updates:SLE-Manager-Tools:15:ppc64le',
               'SUSE:Updates:SLE-Manager-Tools:15:aarch64',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:aarch64',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:ppc64le',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:s390x',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:x86_64',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:x86_64',
               'SUSE:Updates:openSUSE-SLE:15.4',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:s390x',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:ppc64le',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:aarch64',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:x86_64',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:s390x',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:ppc64le',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:aarch64',
               'SUSE:SLE-15-SP1:Update',
               'SUSE:Updates:Storage:6:x86_64',
               'SUSE:Updates:Storage:6:aarch64',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:x86_64',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:s390x',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:ppc64le',
               'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:aarch64',
               'SUSE:Updates:openSUSE-SLE:15.3',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:aarch64',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:ppc64le',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:s390x',
               'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:x86_64'],
  'emu': False,
  'inReview': True,
  'inReviewQAM': True,
  'isActive': True,
  'number': 26186,
  'packages': ['golang-github-prometheus-alertmanager'],
  'project': 'SUSE:Maintenance:26186',
  'rr_number': 281776}
INFO: Inc 26186 does not have any aggregates settings
...
INFO: Inc 26186 has failed job in incidents

But after speaking with Ondrej i found out that the missing results in the dashboard are intentional. Jobs are scheduled for the products openSUSE-SLE:15.3/15.4, but both are still in the development folder, and therefore all openQA test results are ignored by the bot and never reach the dashboard. This condition should probably be highlighted better, because the reviewers were also not aware of it. But so far it looks like we don't have any missing aggregate job results since the fix has been deployed.

Actions #43

Updated by kraih about 2 years ago

  • Status changed from In Progress to Resolved

After talking to more people i'm ready to call this resolved. As a followup i've suggested to Ondrej that we add some kind of indicator in the dashboard when scheduled jobs do exist in openQA, but are ignored because the product is in a development group. Because those will likely result in false positive bug reports in the future. So if anyone is revisiting this in the future, make sure to check that first!!!

Actions #44

Updated by kraih about 2 years ago

  • Status changed from Resolved to In Progress

Not reasolved after all, today we are missing many aggregate jobs. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1187105

ERROR: {"error":"Referenced update settings (49294) do not exist"}
ERROR: {"error":"Referenced update settings (49189) do not exist"}
ERROR: {"error":"Referenced update settings (49412) do not exist"}
...
Actions #45

Updated by kraih about 2 years ago

I will be disabling the cleanup feature completely now.

Actions #47

Updated by kraih about 2 years ago

The next wave of aggregate jobs has arrived on the dashboard and it looks good again. Now we have to wait and see again. If something goes wrong the new logging will help us locate the problem.

Actions #48

Updated by livdywan about 2 years ago

  • Due date changed from 2022-10-14 to 2022-10-21
  • Status changed from In Progress to Feedback

I take it that means waiting another week

Actions #49

Updated by kraih about 2 years ago

So far the data in the dashboard is looking good. If it stays that way i'll permanently remove the cleanup feature and call the issue resolved.

Actions #50

Updated by kraih about 2 years ago

And we've had another case today, pretty much all aggregate jobs vanished form the dashboard at once. Reviewing the logs and database now.

Actions #51

Updated by kraih about 2 years ago

  • Status changed from Feedback to In Progress
Actions #52

Updated by kraih about 2 years ago

Quite interesting results in the logs. The data must have vanished between 11:00 and 15:00. With the cleanup feature disabled now only the incident reuse feature can delete update settings. And there was only one case today:

Oct 17 12:14:18 qam2 dashboard[26091]: [26091] [i] Cleaning up old jobs for incident 26179, rr_number change: 282072 -> 282453
Oct 17 12:14:18 qam2 dashboard[26091]: [26091] [i] Update settings cleaned up for incident 26179: 49611, 49593, 49495, 49703, 49471, 49479, 49492, 49689, 49708, 49853, 49712, 49706, 49806, 49489, 49488, 49630,
49582, 49613, 49698, 49815, 49756, 49800, 49744, 49602, 49587, 49646, 49502, 49486, 49757, 49581, 49583,
49580, 49817, 49812, 49604, 49588, 49485, 49854, 49645, 49754, 49704, 49612, 49422, 49600, 49594, 49743, 49524, 49821, 49589, 49476, 49595, 49537, 49522, 49523,
49831, 49824, 49692, 49813, 49820, 49427, 49591, 49536, 49825, 49707, 49851, 49482, 49597, 49690, 49713, 49532, 49490, 49710, 49585, 49818, 49810, 49474, 49424,
49809, 49752, 49701, 49501, 49481, 49816, 49642, 49814, 49525, 49426, 49421, 49702, 49801, 49478, 49472, 49586, 49477, 49832, 49491, 49521, 49751, 49811,
49695, 49822, 49740, 49592, 49469, 49644, 49855, 49590, 49850, 49470, 49802, 49601, 49535, 49694, 49711, 49709, 49531, 49741, 49714, 49631, 49584, 49803, 49723,
49635, 49697, 49520, 49693, 49691, 49596, 49641, 49473, 49425, 49699, 49634, 49603, 49715, 49534, 49487, 49805, 49696, 49480, 49833, 49633, 49599, 49742, 49598,
49722, 49705, 49721, 49494, 49605, 49755, 49799, 49503, 49852, 49745, 49493, 49804, 49647, 49700, 49632, 49579, 49483, 49475, 49823, 49807, 49808, 49484,
49819

So the only remaining explanation is that this incident reuse caused a cascading delete of update settings and their jobs that was large enough to clear the entire blocked list. That is certainly unexpected, but opens up a new option for a possible fix. Instead of deleting all the associated data, we could also just sever the link the between incident number and update settings in the incident_in_update table.

Actions #53

Updated by kraih about 2 years ago

I've deployed a new fix that only removes the link between incident number and update settings, but does not remove the settings/jobs anymore. https://github.com/openSUSE/qem-dashboard/commit/e6c1c8914458a3e92055d09be148bf65e70c8793

Actions #54

Updated by kraih about 2 years ago

  • Status changed from In Progress to Feedback

Back to observing.

Actions #55

Updated by kraih about 2 years ago

Feedback from test reviewers has been good so far. I'm not going to call this resolved yet, since i don't want to jinx it again. :)

Actions #56

Updated by livdywan about 2 years ago

  • Due date changed from 2022-10-21 to 2022-10-28

I guess we can wait til the end of the week

Actions #57

Updated by kraih about 2 years ago

cdywan wrote:

I guess we can wait til the end of the week

Or someone else than me sets it to resolved. ;)

Actions #58

Updated by kraih about 2 years ago

All data vanishing today was not related to this ticket, it was just a SMELT outage.

Actions #59

Updated by okurz about 2 years ago

  • Due date deleted (2022-10-28)
  • Status changed from Feedback to Resolved

As there is no easy way mentioned how to quickly check for incidents where we might miss aggregate tests I checked results on http://dashboard.qam.suse.de/blocked manually. I found multiple cases where there are aggregate tests but no incident tests. Then I found http://dashboard.qam.suse.de/incident/26632 for kubevirt but it was created just today so there are incident tests but the request is simply not included in any aggregate test build yet which would come tonight. Other examples are e.g. https://smelt.suse.de/incident/26576/ about "python-pylint" which has no tests linked at all. For those cases I have created https://gitlab.suse.de/tools/smelt/-/issues/924 but also reopened #99072 because there should be Leap related openQA tests but there aren't hence the approval is blocked. So in conclusion I did not find any further cases where there are incident tests but no aggregate tests hence resolving.

Actions

Also available in: Atom PDF