Project

General

Profile

action #114694

Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M

Added by okurz 2 months ago. Updated 6 days ago.

Status:
Feedback
Priority:
High
Assignee:
Target version:
Start date:
2022-07-26
Due date:
2022-09-23
% Done:

0%

Estimated time:

Description

Observation

Why does http://dashboard.qam.suse.de/incident/25171 show no aggregates?

I assume this is why https://build.suse.de/request/show/276375 wasn't yet approved by qam-openqa. But for example https://openqa.suse.de/admin/productlog?id=952035 mentions the incident so aggregate openQA jobs do exist. Also see
https://suse.slack.com/archives/C02AJ1E568M/p1658835230635849

Expected result

  • For every incident an entry should show up in https://dashboard.qam.suse.de
  • Every incident in https://dashboard.qam.suse.de incident + aggregate tests are triggered
  • Results from incident + aggregate tests show up on the dashboard
  • If there is a non-zero amount of related openQA jobs and none of them failed then qem-bot approves in IBS

Acceptance criteria

Suggestions


Related issues

Related to QA - action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:MResolved2022-04-28

Related to openQA Project - action #109310: qem-bot/dashboard - mixed old and new incidents size:MResolved2022-03-31

History

#1 Updated by osukup 2 months ago

from logs everything looks ok , but in database:

dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP3';
 count 
-------
   803
(1 row)

dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP4';
 count 
-------
     0
(1 row)

--> https://github.com/openSUSE/qem-bot/pull/54 - for logging result of PUT request..

#2 Updated by osukup 2 months ago

manually pushed data to database ( using python3> requests.put(url, headers=token,json=data["qem"]) ) with data parsed from gitlab log, all went OK so probably update of dashboard database during run went into hidden problems

#3 Updated by osukup 2 months ago

--> so we need add retry based on status to post_qem method

#4 Updated by cdywan 2 months ago

  • Subject changed from Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists to Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by osukup 2 months ago

  • Assignee set to osukup

#6 Updated by osukup about 2 months ago

  • Status changed from Workable to In Progress

#7 Updated by openqa_review about 2 months ago

  • Due date set to 2022-08-12

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by osukup about 2 months ago

  • Status changed from In Progress to Feedback

merged both changes -> retry on all requests use + logging result of PUT operation

#9 Updated by okurz about 2 months ago

  • Priority changed from High to Immediate

Since these changes are live I see that the "schedule" step fails repeatedly, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1078520#L123

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
    **response_kw
  [Previous line repeated 2 more times]
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 807, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "./qem-bot/bot-ng.py", line 7, in <module>
    main()
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 41, in main
    sys.exit(cfg.func(cfg))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 24, in do_incident_schedule
    bot = OpenQABot(args)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqabot.py", line 23, in __init__
    self.incidents = get_incidents(self.token)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 42, in get_incidents
    xs.append(Incident(i))
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 70, in __init__
    self.revisions = self._rev(self.channels, self.project)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 95, in _rev
    max_rev = get_max_revision(lrepos, archver.arch, project)
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 46, in get_max_revision
    raise e
  File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 35, in get_max_revision
    root = ET.fromstring(requests.get(url).text)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 507, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))

please check with urgency if this is a regression and fix it or investigate the different, new failure. Until then I have disabled both the schedule steps "updates only schedule" and "incidents only schedule" on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules.

#11 Updated by cdywan about 2 months ago

osukup wrote:

https://github.com/openSUSE/qem-bot/pull/58

This was a very fast PR and review. You guys are awesome 😁️

#12 Updated by osukup about 2 months ago

follow up https://github.com/openSUSE/qem-bot/pull/59

after I looked how is backoff_factor used in urllib3

#13 Updated by okurz about 2 months ago

  • Priority changed from Immediate to High

"schedule incidents" passed in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1079141, re-enabled schedule for both incidents and aggregates

#14 Updated by mgrifalconi about 2 months ago

Hello, I see the ticket is on 'feedback' state but there are still several update requests with only an incident green box and not being auto approved. http://dashboard.qam.suse.de/blocked

Few examples:
http://dashboard.qam.suse.de/incident/24723
http://dashboard.qam.suse.de/incident/24743
http://dashboard.qam.suse.de/incident/24762

#15 Updated by jbaier_cz about 2 months ago

  • Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added

#16 Updated by jbaier_cz about 2 months ago

  • Related to action #109310: qem-bot/dashboard - mixed old and new incidents size:M added

#17 Updated by mgrifalconi about 2 months ago

Aggregates disappeared again, few minutes ago. Before that they were showing up correctly.
They disappeared just after every single aggregate got green :(
Hope that this was not the cause, since it does not happens often :P

#18 Updated by kraih about 2 months ago

I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb

#19 Updated by cdywan about 2 months ago

Discussed briefly in the in the Unblock. It would be great if others take a look and review the current code for flaws i.e. pretend it's a new pull request and we can then see if that helps us find some gaps or ideas where to improve logging. A the latest tomorrow in the mob session we can discuss it, or comment here earlier

#20 Updated by kraih about 2 months ago

kraih wrote:

I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb

We've investigated this issue during the mob session today. A regression in the cleanup code has been ruled out, there are no log entries for the very recent incident 25413. We have tracked it down to a call of this code from the qem-bot. The bot thinks the update_settings have been added to the database, but they do not appear to have ever been added. The next step will be to find out if the problem is on the bot or dashboard side.

#21 Updated by cdywan about 1 month ago

  • Due date changed from 2022-08-12 to 2022-08-26

Bumping the due date in accordance with on-going research.

#22 Updated by osukup about 1 month ago

We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on

needs coop with openqa-qam reviewers ( mgrifalconi ?)

#23 Updated by okurz about 1 month ago

We do not have a good way to find "incidents with missing aggregate tests" programmatically so we are relying on users to tell us about further suspicious cases. If anybody finds cases where you assume there are missing aggregate tests please tell us and we can take a look into it.

#25 Updated by cdywan about 1 month ago

  • Due date changed from 2022-08-26 to 2022-09-02

I guess this still needs to be validated

#26 Updated by kraih 28 days ago

I have a new suspicion where the problem could be, but to confirm that i need one concrete example of an incident with missing aggregate jobs from the past few days.

#27 Updated by cdywan 26 days ago

osukup wrote:

We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on

https://github.com/openSUSE/qem-bot/pull/64/files

#28 Updated by cdywan 26 days ago

  • Due date changed from 2022-09-02 to 2022-09-23
  • Assignee changed from osukup to kraih

We discussed that we'll try to spot actual examples which might be easier if some of us join UV temporarily (proposed elsewhere), make sure we confirm the problem and Sebastian will ideally look into the fix

#29 Updated by kraih 14 days ago

Since there's not been much progress recently, i'll look into a possible solution later this week.

#30 Updated by kraih 11 days ago

  • Status changed from Feedback to In Progress

#31 Updated by kraih 7 days ago

My suspicion turned out to be a definitive bug. Aggregate jobs could be deleted by accident if update_settings are being used for longer than 90 days. And the solution i had in mind looks promising in preliminary tests. So i'll try to add some unit tests too and then deploy it to production.

#32 Updated by kraih 7 days ago

Committed a possible fix. Now we'll have to keep an eye on it again. https://github.com/openSUSE/qem-dashboard/commit/9cf6e655007fad2a366c9a9b4bf6f0f353de69fd

#33 Updated by kraih 6 days ago

  • Status changed from In Progress to Feedback

The change has been deployed via pipeline 10 hours ago and is now in production.

Also available in: Atom PDF