action #114694
closedcoordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release
coordination #117694: [epic] Stable and reliable qem-bot
Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M
0%
Description
Observation¶
Why does http://dashboard.qam.suse.de/incident/25171 show no aggregates?
I assume this is why https://build.suse.de/request/show/276375 wasn't yet approved by qam-openqa. But for example https://openqa.suse.de/admin/productlog?id=952035 mentions the incident so aggregate openQA jobs do exist. Also see
https://suse.slack.com/archives/C02AJ1E568M/p1658835230635849
Expected result¶
- For every incident an entry should show up in https://dashboard.qam.suse.de
- Every incident in https://dashboard.qam.suse.de incident + aggregate tests are triggered
- Results from incident + aggregate tests show up on the dashboard
- If there is a non-zero amount of related openQA jobs and none of them failed then qem-bot approves in IBS
Acceptance criteria¶
- AC1: There are no more aggregate jobs missing for new incidents
Suggestions¶
- See if this could be a regression from recent clean-up (https://github.com/openSUSE/qem-dashboard/pull/78, https://github.com/openSUSE/qem-dashboard/pull/63)
- Check existing logs for "Cleaning up old jobs for incident..." messages related to the incident (https://github.com/openSUSE/qem-dashboard/blob/af4e1672993265709f9d97a670d5653b9aef8903/lib/Dashboard/Model/Incidents.pm#L266)
- Dashboard runs as service "dashboard" on qam2.suse.de, journal currently only contains data for about one day due to lots of HTTP request logging and small journal size
- Maybe add more logging for cleanups to help with debugging similar cases in the future
Updated by osukup over 2 years ago
from logs everything looks ok , but in database:
dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP3';
count
-------
803
(1 row)
dashboard_db=# select count(*) from update_openqa_settings where product = 'SLES15SP4';
count
-------
0
(1 row)
--> https://github.com/openSUSE/qem-bot/pull/54 - for logging result of PUT request..
Updated by osukup over 2 years ago
manually pushed data to database ( using python3> requests.put(url, headers=token,json=data["qem"]) ) with data parsed from gitlab log, all went OK so probably update of dashboard database during run went into hidden problems
Updated by osukup over 2 years ago
--> so we need add retry based on status to post_qem
method
Updated by livdywan over 2 years ago
- Subject changed from Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists to Incident seems to have missing aggregate test results in qem-dashboard but openQA jobs exists size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by osukup over 2 years ago
- Status changed from Workable to In Progress
Updated by openqa_review over 2 years ago
- Due date set to 2022-08-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by osukup over 2 years ago
- Status changed from In Progress to Feedback
merged both changes -> retry on all requests use + logging result of PUT operation
Updated by okurz over 2 years ago
- Priority changed from High to Immediate
Since these changes are live I see that the "schedule" step fails repeatedly, e.g. in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1078520#L123
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
**response_kw
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
**response_kw
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 830, in urlopen
**response_kw
[Previous line repeated 2 more times]
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 807, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./qem-bot/bot-ng.py", line 7, in <module>
main()
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/main.py", line 41, in main
sys.exit(cfg.func(cfg))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/args.py", line 24, in do_incident_schedule
bot = OpenQABot(args)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/openqabot.py", line 23, in __init__
self.incidents = get_incidents(self.token)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/qem.py", line 42, in get_incidents
xs.append(Incident(i))
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 70, in __init__
self.revisions = self._rev(self.channels, self.project)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/types/incident.py", line 95, in _rev
max_rev = get_max_revision(lrepos, archver.arch, project)
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 46, in get_max_revision
raise e
File "/builds/qa-maintenance/bot-ng/qem-bot/openqabot/loader/repohash.py", line 35, in get_max_revision
root = ET.fromstring(requests.get(url).text)
File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 507, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPConnectionPool(host='download.suse.de', port=80): Max retries exceeded with url: /ibs/SUSE:/Maintenance:/18458/SUSE_Updates_SLE-Product-SLES_15-SP1-BCL_x86_64/repodata/repomd.xml (Caused by ResponseError('too many 404 error responses',))
please check with urgency if this is a regression and fix it or investigate the different, new failure. Until then I have disabled both the schedule steps "updates only schedule" and "incidents only schedule" on https://gitlab.suse.de/qa-maintenance/bot-ng/-/pipeline_schedules.
Updated by livdywan over 2 years ago
osukup wrote:
This was a very fast PR and review. You guys are awesome 😁️
Updated by osukup over 2 years ago
follow up https://github.com/openSUSE/qem-bot/pull/59
after I looked how is backoff_factor
used in urllib3
Updated by okurz over 2 years ago
- Priority changed from Immediate to High
"schedule incidents" passed in https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1079141, re-enabled schedule for both incidents and aggregates
Updated by mgrifalconi over 2 years ago
Hello, I see the ticket is on 'feedback' state but there are still several update requests with only an incident green box and not being auto approved. http://dashboard.qam.suse.de/blocked
Few examples:
http://dashboard.qam.suse.de/incident/24723
http://dashboard.qam.suse.de/incident/24743
http://dashboard.qam.suse.de/incident/24762
Updated by jbaier_cz over 2 years ago
- Related to action #110409: qem-dashboard - remove old openQA jobs when rr_number changes size:M added
Updated by jbaier_cz over 2 years ago
- Related to action #109310: qem-bot/dashboard - mixed old and new incidents size:M added
Updated by mgrifalconi over 2 years ago
Aggregates disappeared again, few minutes ago. Before that they were showing up correctly.
They disappeared just after every single aggregate got green :(
Hope that this was not the cause, since it does not happens often :P
Updated by kraih over 2 years ago
I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb
Updated by livdywan over 2 years ago
Discussed briefly in the in the Unblock. It would be great if others take a look and review the current code for flaws i.e. pretend it's a new pull request and we can then see if that helps us find some gaps or ideas where to improve logging. A the latest tomorrow in the mob session we can discuss it, or comment here earlier
Updated by kraih over 2 years ago
kraih wrote:
I've made a small change to the dashboard so the journal is not flooded by HTTP request data anymore. Should make it easier to keep track of what data the dashboard cleans up (and rule out any possible regressions there). https://github.com/openSUSE/qem-dashboard/commit/1e6321f02b0c082f5659b24ef97898b24f248fcb
We've investigated this issue during the mob session today. A regression in the cleanup code has been ruled out, there are no log entries for the very recent incident 25413
. We have tracked it down to a call of this code from the qem-bot. The bot thinks the update_settings have been added to the database, but they do not appear to have ever been added. The next step will be to find out if the problem is on the bot or dashboard side.
Updated by livdywan over 2 years ago
- Due date changed from 2022-08-12 to 2022-08-26
Bumping the due date in accordance with on-going research.
Updated by osukup over 2 years ago
We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on
needs coop with openqa-qam reviewers ( @mgrifalconi ?)
Updated by okurz over 2 years ago
We do not have a good way to find "incidents with missing aggregate tests" programmatically so we are relying on users to tell us about further suspicious cases. If anybody finds cases where you assume there are missing aggregate tests please tell us and we can take a look into it.
Updated by martinsmac over 2 years ago
Hello, some tests with No data yet, and more than 1 day in smelt queue:
http://dashboard.qam.suse.de/blocked
http://dashboard.qam.suse.de/incident/24979
http://dashboard.qam.suse.de/incident/25004
http://dashboard.qam.suse.de/incident/25265
http://dashboard.qam.suse.de/incident/25213
http://dashboard.qam.suse.de/incident/25423
Could you please verify? Thank you
Updated by livdywan over 2 years ago
- Due date changed from 2022-08-26 to 2022-09-02
I guess this still needs to be validated
Updated by kraih over 2 years ago
I have a new suspicion where the problem could be, but to confirm that i need one concrete example of an incident with missing aggregate jobs from the past few days.
Updated by livdywan over 2 years ago
osukup wrote:
We added logging of post ID to qem-bot , and next we need another occasion of problem to analyze what is going on
Updated by livdywan over 2 years ago
- Due date changed from 2022-09-02 to 2022-09-23
- Assignee changed from osukup to kraih
We discussed that we'll try to spot actual examples which might be easier if some of us join UV temporarily (proposed elsewhere), make sure we confirm the problem and Sebastian will ideally look into the fix
Updated by kraih over 2 years ago
Since there's not been much progress recently, i'll look into a possible solution later this week.
Updated by kraih over 2 years ago
My suspicion turned out to be a definitive bug. Aggregate jobs could be deleted by accident if update_settings are being used for longer than 90 days. And the solution i had in mind looks promising in preliminary tests. So i'll try to add some unit tests too and then deploy it to production.
Updated by kraih over 2 years ago
Committed a possible fix. Now we'll have to keep an eye on it again. https://github.com/openSUSE/qem-dashboard/commit/9cf6e655007fad2a366c9a9b4bf6f0f353de69fd
Updated by kraih over 2 years ago
- Status changed from In Progress to Feedback
The change has been deployed via pipeline 10 hours ago and is now in production.
Updated by okurz about 2 years ago
- Due date changed from 2022-09-23 to 2022-10-14
Regarding the ACs I tried select * from job_settings where key ~ 'TEST_REPOS' and value ~ '25171' limit 10;
but have not found any results. I wonder, can we re-add results so that they show up where they should?
Updated by jbaier_cz about 2 years ago
- Related to action #117619: Bot approved update request with failing tests size:M added
Updated by kraih about 2 years ago
okurz wrote:
Regarding the ACs I tried
select * from job_settings where key ~ 'TEST_REPOS' and value ~ '25171' limit 10;
but have not found any results. I wonder, can we re-add results so that they show up where they should?
That data would have to come from the bot, not sure if it is possible. Probably not.
Updated by kraih about 2 years ago
I've simplified the ACs, since bringing back old data would be very hard, if it is possible at all.
Updated by kraih about 2 years ago
I went through the incident data currently in the dashboard and it mostly looks fine. There are a few suspicious ones in the blocked list, one was new enough to have been created after the fix was deployed. Incident 26186 (Smelt) had no data at all yet. From the bot logs in the pipelines it looked like jobs were being scheduled though:
INFO: not scheduling: Flavor: Leap-DVD-Incidents, version: 15.3 incident: 26186 , arch: x86_64 - exists in openQA
{'approved': False,
'channels': ['SUSE:Updates:SLE-Manager-Tools:15:x86_64',
'SUSE:Updates:SLE-Manager-Tools:15:s390x',
'SUSE:Updates:SLE-Manager-Tools:15:ppc64le',
'SUSE:Updates:SLE-Manager-Tools:15:aarch64',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:aarch64',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:ppc64le',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:s390x',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:x86_64',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:x86_64',
'SUSE:Updates:openSUSE-SLE:15.4',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:s390x',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:ppc64le',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.3:aarch64',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:x86_64',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:s390x',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:ppc64le',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.2:aarch64',
'SUSE:SLE-15-SP1:Update',
'SUSE:Updates:Storage:6:x86_64',
'SUSE:Updates:Storage:6:aarch64',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:x86_64',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:s390x',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:ppc64le',
'SUSE:Updates:SLE-Module-SUSE-Manager-Proxy:4.1:aarch64',
'SUSE:Updates:openSUSE-SLE:15.3',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:aarch64',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:ppc64le',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:s390x',
'SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:x86_64'],
'emu': False,
'inReview': True,
'inReviewQAM': True,
'isActive': True,
'number': 26186,
'packages': ['golang-github-prometheus-alertmanager'],
'project': 'SUSE:Maintenance:26186',
'rr_number': 281776}
INFO: Inc 26186 does not have any aggregates settings
...
INFO: Inc 26186 has failed job in incidents
But after speaking with Ondrej i found out that the missing results in the dashboard are intentional. Jobs are scheduled for the products openSUSE-SLE:15.3
/15.4
, but both are still in the development folder, and therefore all openQA test results are ignored by the bot and never reach the dashboard. This condition should probably be highlighted better, because the reviewers were also not aware of it. But so far it looks like we don't have any missing aggregate job results since the fix has been deployed.
Updated by kraih about 2 years ago
- Status changed from In Progress to Resolved
After talking to more people i'm ready to call this resolved. As a followup i've suggested to Ondrej that we add some kind of indicator in the dashboard when scheduled jobs do exist in openQA, but are ignored because the product is in a development group. Because those will likely result in false positive bug reports in the future. So if anyone is revisiting this in the future, make sure to check that first!!!
Updated by kraih about 2 years ago
- Status changed from Resolved to In Progress
Not reasolved after all, today we are missing many aggregate jobs. https://gitlab.suse.de/qa-maintenance/bot-ng/-/jobs/1187105
ERROR: {"error":"Referenced update settings (49294) do not exist"}
ERROR: {"error":"Referenced update settings (49189) do not exist"}
ERROR: {"error":"Referenced update settings (49412) do not exist"}
...
Updated by kraih about 2 years ago
I will be disabling the cleanup feature completely now.
Updated by kraih about 2 years ago
And all removed update settings will be logged. https://github.com/openSUSE/qem-dashboard/commit/4a4ea13506ba4ae8be1266667da80ca284db53a8
Updated by kraih about 2 years ago
The next wave of aggregate jobs has arrived on the dashboard and it looks good again. Now we have to wait and see again. If something goes wrong the new logging will help us locate the problem.
Updated by livdywan about 2 years ago
- Due date changed from 2022-10-14 to 2022-10-21
- Status changed from In Progress to Feedback
I take it that means waiting another week
Updated by kraih about 2 years ago
So far the data in the dashboard is looking good. If it stays that way i'll permanently remove the cleanup feature and call the issue resolved.
Updated by kraih about 2 years ago
And we've had another case today, pretty much all aggregate jobs vanished form the dashboard at once. Reviewing the logs and database now.
Updated by kraih about 2 years ago
Quite interesting results in the logs. The data must have vanished between 11:00 and 15:00. With the cleanup feature disabled now only the incident reuse feature can delete update settings. And there was only one case today:
Oct 17 12:14:18 qam2 dashboard[26091]: [26091] [i] Cleaning up old jobs for incident 26179, rr_number change: 282072 -> 282453
Oct 17 12:14:18 qam2 dashboard[26091]: [26091] [i] Update settings cleaned up for incident 26179: 49611, 49593, 49495, 49703, 49471, 49479, 49492, 49689, 49708, 49853, 49712, 49706, 49806, 49489, 49488, 49630,
49582, 49613, 49698, 49815, 49756, 49800, 49744, 49602, 49587, 49646, 49502, 49486, 49757, 49581, 49583,
49580, 49817, 49812, 49604, 49588, 49485, 49854, 49645, 49754, 49704, 49612, 49422, 49600, 49594, 49743, 49524, 49821, 49589, 49476, 49595, 49537, 49522, 49523,
49831, 49824, 49692, 49813, 49820, 49427, 49591, 49536, 49825, 49707, 49851, 49482, 49597, 49690, 49713, 49532, 49490, 49710, 49585, 49818, 49810, 49474, 49424,
49809, 49752, 49701, 49501, 49481, 49816, 49642, 49814, 49525, 49426, 49421, 49702, 49801, 49478, 49472, 49586, 49477, 49832, 49491, 49521, 49751, 49811,
49695, 49822, 49740, 49592, 49469, 49644, 49855, 49590, 49850, 49470, 49802, 49601, 49535, 49694, 49711, 49709, 49531, 49741, 49714, 49631, 49584, 49803, 49723,
49635, 49697, 49520, 49693, 49691, 49596, 49641, 49473, 49425, 49699, 49634, 49603, 49715, 49534, 49487, 49805, 49696, 49480, 49833, 49633, 49599, 49742, 49598,
49722, 49705, 49721, 49494, 49605, 49755, 49799, 49503, 49852, 49745, 49493, 49804, 49647, 49700, 49632, 49579, 49483, 49475, 49823, 49807, 49808, 49484,
49819
So the only remaining explanation is that this incident reuse caused a cascading delete of update settings and their jobs that was large enough to clear the entire blocked list. That is certainly unexpected, but opens up a new option for a possible fix. Instead of deleting all the associated data, we could also just sever the link the between incident number and update settings in the incident_in_update
table.
Updated by kraih about 2 years ago
I've deployed a new fix that only removes the link between incident number and update settings, but does not remove the settings/jobs anymore. https://github.com/openSUSE/qem-dashboard/commit/e6c1c8914458a3e92055d09be148bf65e70c8793
Updated by kraih about 2 years ago
Feedback from test reviewers has been good so far. I'm not going to call this resolved yet, since i don't want to jinx it again. :)
Updated by livdywan about 2 years ago
- Due date changed from 2022-10-21 to 2022-10-28
I guess we can wait til the end of the week
Updated by kraih about 2 years ago
cdywan wrote:
I guess we can wait til the end of the week
Or someone else than me sets it to resolved. ;)
Updated by kraih about 2 years ago
All data vanishing today was not related to this ticket, it was just a SMELT outage.
Updated by okurz about 2 years ago
- Due date deleted (
2022-10-28) - Status changed from Feedback to Resolved
As there is no easy way mentioned how to quickly check for incidents where we might miss aggregate tests I checked results on http://dashboard.qam.suse.de/blocked manually. I found multiple cases where there are aggregate tests but no incident tests. Then I found http://dashboard.qam.suse.de/incident/26632 for kubevirt but it was created just today so there are incident tests but the request is simply not included in any aggregate test build yet which would come tonight. Other examples are e.g. https://smelt.suse.de/incident/26576/ about "python-pylint" which has no tests linked at all. For those cases I have created https://gitlab.suse.de/tools/smelt/-/issues/924 but also reopened #99072 because there should be Leap related openQA tests but there aren't hence the approval is blocked. So in conclusion I did not find any further cases where there are incident tests but no aggregate tests hence resolving.