action #93943
closedopenqa-review pipeline fails accessing OSD test overview pages sometimes, more retries?
Added by okurz over 3 years ago. Updated over 3 years ago.
0%
Description
Observation¶
https://gitlab.suse.de/openqa/openqa-review/-/jobs/458168#L70 shows
WARNING:openqa_review.browser:Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=20210610-2&groupid=366 was not successful after multiple retries, giving up
Traceback (most recent call last):
File "/usr/bin/openqa-review", line 11, in <module>
load_entry_point('openqa-review==0.0.0', 'console_scripts', 'openqa-review')()
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1299, in main
report = generate_report(args)
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1233, in generate_report
return Report(browser, args, root_url, job_groups)
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1184, in __init__
self.report[k] = self._one_report(v)
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1195, in _one_report
return ProductReport(self.browser, job_group_url, self.root_url, self.args)
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 973, in __init__
previous_details = browser.get_soup(previous_url)
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 72, in get_soup
return BeautifulSoup(self.get_page(url), 'html.parser')
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 103, in get_page
content = self._get(absolute_url, as_json=as_json)
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 147, in _get
raise DownloadError(msg)
openqa_review.browser.DownloadError: Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=20210610-2&groupid=366 was not successful after multiple retries, giving up
Expected result¶
- AC1: Daily reports are generated in a stable manner
Suggestions¶
Can we try with more retries?
Updated by livdywan over 3 years ago
I don't know if these are related. But it looks like osd is responsive. And it's only https://openqa.suse.de/tests/overview
requests that seem to time out. It doesn't look like the network is the problem:
ERROR:openqa_review.openqa_review:Could not find any soft failure reference within details of soft-failed job 'https://openqa.suse.de/tests/5590738'. Could be deleted workaround needle?.
ERROR:openqa_review.openqa_review:Could not find any soft failure reference within details of soft-failed job 'https://openqa.suse.de/tests/5590756'. Could be deleted workaround needle?.
[...]
ERROR:openqa_review.openqa_review:Failed to process {'state': 'NEW_SOFT_ISSUE', 'href': '/tests/6230452', 'failedmodules': []} with error Request to https://openqa.suse.de/tests/6230452/file/details-console_reboot#1.json was not successful, status code: 404. Skipping current result
Maybe related to #92854 🤔️
Updated by okurz over 3 years ago
"Could not find any soft failure" and the other error are not related, these are different problems.
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Ack. For now I'm going to try and improve the error. As we're discussing it in chat it came up that we should clarify if it's 404 or 500 or timeout.
Updated by livdywan over 3 years ago
For reference the command to reproduce locally:
mkdir -p
python3 ./openqa_review/openqa_review.py --host https://openqa.suse.de -n -r -T --no-empty-sections --include-softfails --running-threshold=2 --exclude-job-groups '^(Released|Development|old)' --save --save-dir out
(Side note, I could not avoid having valid bugzilla credentials even after dropping --query-issue-status
, maybe an unrelated bug)
To reproduce the error I changed the host to test.openqa.suse.de
(.invalid
doesn't work here).
I want to say this is by definition not a 502/503/504 since the error would say "was not successful, status code: ". But the code should preserve at least the last exception or request status. Currently due to the retries that gets lost.
Updated by okurz over 3 years ago
the format of your comment is broken.
(Side note, I could not avoid having valid bugzilla credentials even after dropping
--query-issue-status
, maybe an unrelated bug)
That's expected because of the option --reminder-comment-on-issues
which I suggest you avoid in local testing with live instances as that means that your user account will write comments on bugzilla.
Updated by livdywan over 3 years ago
Thanks! Fixed the formatting (had some rogue line breaks in there) and dropped --reminder-comment-on-issues
from the command.
Updated by openqa_review over 3 years ago
- Due date set to 2021-06-29
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-29 to 2021-06-21
The PR got delayed due to some style questions. Hoping to merge and check the errors today.
Updated by okurz over 3 years ago
https://github.com/os-autoinst/openqa_review/pull/147 merged. The gitlab CI pipeline uses container images using the released packages of openqa-review from openSUSE Tumbleweed. So after I would create a release and create a submit request to openSUSE Factory some days can pass until the gitlab CI pipeline has access to a more recent version.
Updated by okurz over 3 years ago
How about using /tests/overview.json instead of HTML document parsing and the suggested additional retries?
Updated by livdywan over 3 years ago
Ack, I'm going to propose additional PRs. I had thought it was gonna go quicker. In hindsight I should've prepared those in parallel.
Updated by livdywan over 3 years ago
okurz wrote:
How about [...] the suggested additional retries?
Updated by livdywan over 3 years ago
cdywan wrote:
okurz wrote:
How about [...] the suggested additional retries?
After proposing the simple increase I also pondered a little more and decided to try and replace the loop and else approach:
https://github.com/os-autoinst/openqa_review/pull/149
This is similar to what's described here although I didn't create a new class for a single use: https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/
Updated by livdywan over 3 years ago
okurz wrote:
How about using /tests/overview.json instead of HTML document parsing[...]
The "soup" parsing is not very obvious at first, code like find_all(class_="badge")
w/ no documentation of the expected data, so I had to do some reverse engineering. There's also no test coverage so I'm looking into that first. Iterating changes with real requests is way too slow.
For reference the relevant "soup" would look something like this:
<div id="summary" class="card border-success">
<div class="card-header">
Overall Summary of
<strong><a href="/group_overview/4">openSUSE Tumbleweed PowerPC</a></strong>
build 20160917
<div class="time-params">
showing latest jobs,
<a href="/tests/overview?distri=opensuse&version=Tumbleweed&build=20160917&groupid=4&t=2021-06-22+10%3A54%3A30+%2B0000">overview fixed to the current time</a>
</div>
</div>
<div class="card-body">
Passed: <span class="badge badge-success">0</span>
Failed: <span class="badge badge-danger">0</span>
</div>
</div>
<div class="card border-danger" id="summary">
<div class="card-header">Overall Summary of <strong><a href="/group_overview/377">Create_hdd_stable_hosts</a></strong> build 15_GM <div class="time-params"> showing latest jobs, <a href="/tests/overview?distri=sle&version=15&build=15_GM&groupid=377&t=2021-06-21+16%3A23%3A57+%2B
000">overview fixed to the current time</a></div></div><div class="card-body">Passed: <span class="badge badge-success">0</span>Soft-Failed:<span class="badge badge-warning">2</span>Failed: <span class="badge badge-danger">1</span></div></div>
<i class="status fa fa-circle result_passed" title="Done: passed"></i>
</a>
<div class="card border-danger" id="summary">
<div class="card-header">Overall Summary of <strong><a href="/group_overview/377">Create_hdd_stable_hosts</a></strong>build 15-SP3_GM<div class="time-params">showing latest jobs,<a href="/tests/overview?distri=sle&version=15-SP3&build=15-SP3_GM&groupid=377&t=2021-06-21+16%3A26%3
A08+%2B0000">overview fixed to the current time</a></div></div><div class="card-body">Passed: <span class="badge badge-success">3</span>Incomplete:<span class="badge badge-secondary">1</span>Failed: <span class="badge badge-danger">0</span></div></div>
<i class="status fa fa-circle result_incomplete" title="Done: incomplete"></i>
<h3>Flavor: Installer-DVD-POST</h3>
<table class="overview fixedheader table table-striped table-hover" id="results_Installer-DVD-POST">
<thead>
<tr>
<th>Test</th>
<th id="flavor_Installer-DVD-POST_arch_ppc64le">ppc64le</th>
<th id="flavor_Installer-DVD-POST_arch_s390x">s390x</th>
<th id="flavor_Installer-DVD-POST_arch_x86_64">x86_64</th>
</tr>
</thead>
<!-- body omitted -->
</table>
Updated by okurz over 3 years ago
- Due date changed from 2021-06-21 to 2021-06-29
as discussed with cdywan setting due date back to original cycle time based estimate as cdywan learned that we need to wait a bit longer to incorporate proper package updates through OBS.
Updated by okurz over 3 years ago
- Status changed from In Progress to Feedback
PR merged, created new release https://build.opensuse.org/request/show/901431 . We should wait for the submission to be accepted into Factory, a new Tumbleweed snapshot after that and then a new container image that is used in the gitlab CI pipeline.
EDIT: https://build.opensuse.org/request/show/901431 accepted
Updated by jbaier_cz over 3 years ago
Not sure, if that is related, but it seems that the overview page is still a problem sometimes, see https://gitlab.suse.de/openqa/openqa-review/-/jobs/474652
Updated by livdywan over 3 years ago
jbaier_cz wrote:
Not sure, if that is related, but it seems that the overview page is still a problem sometimes, see https://gitlab.suse.de/openqa/openqa-review/-/jobs/474652
Yes. My PR wasn't a magical fix but meant to make it more robust and reveal errors rather than ignoring them.
WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)")': /api/v1/jobs/5990475/details
[...]
WARNING:urllib3.connectionpool:Retrying (Retry(total=6, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)")': /tests/overview?distri=sle&version=15-SP3&build=186.1&groupid=143
So these are expected. I'm not sure what this one is, though. It looks like we're getting corrupt JSON. I'm thinking we should dump the corrupt data (which might be valid unexpected data). I've not seen this locally:
File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 996, in _get_bugref_for_softfailed_module
details_json = self.test_browser.get_json(details_url)
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 75, in get_json
return self.get_page(url, as_json=True, cache=cache)
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 102, in get_page
content = self._get(absolute_url, as_json=as_json)
File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 147, in _get
content = r.json() if as_json else r.content.decode("utf8")
[...]
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Updated by livdywan over 3 years ago
Updated by jbaier_cz over 3 years ago
I will probably state the obvious, judging from json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
, the content is not a JSON or (more probably) the content is just empty:
import json
>>> json.loads('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-29 to 2021-07-02
Proposed a small PR to raise exceptions for HTTP errors https://github.com/os-autoinst/openqa_review/pull/163 after discussing this on Jitsi. Hypothesis being that we see an empty response because the request failed after the last retry w/o raising an exception.
Updated by livdywan over 3 years ago
- Due date changed from 2021-07-02 to 2021-07-07
cdywan wrote:
Proposed a small PR to raise exceptions for HTTP errors https://github.com/os-autoinst/openqa_review/pull/163 after discussing this on Jitsi. Hypothesis being that we see an empty response because the request failed after the last retry w/o raising an exception.
PR got merged, still needs to be release/deployed hence moving the due date
Updated by livdywan over 3 years ago
Related MRs:
- https://gitlab.suse.de/openqa/openqa-review/-/merge_requests/6
- https://gitlab.suse.de/openqa/openqa-review/-/merge_requests/7 And @okurz chaned the time of the pipeline
Updated by okurz over 3 years ago
We discussed it after the weekly (okurz, cdywan, jbaier, dheidler) and applied the following changes:
- Changed the cron schedule in https://gitlab.suse.de/openqa/openqa-review/-/pipeline_schedules from
0 5 * * 1-5
to47 23 * * *
to- avoid full-hour congestion
- avoid potentially busy time with work-time overlap from both APAC+EMEA
- also run during less -congested weekend to crosscheck if the behaviour differs load-dependant
- Still generate the complete status page with the generated reports even if some fail: https://gitlab.suse.de/openqa/openqa-review/-/merge_requests/6
- Retry steps in gitlab on failures: https://gitlab.suse.de/openqa/openqa-review/-/merge_requests/7
- Changed https://build.opensuse.org/projects/home:okurz:container:ca/meta to pull the python-openqa_review package from https://build.opensuse.org/package/show/home:okurz/python-openqa_review which is updated on every git commit to master in https://github.com/os-autoinst/openqa_review
EDIT: Then we see requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://progress.opensuse.org/issues/94342.json
https://gitlab.suse.de/openqa/openqa-review/-/settings/ci_cd has the redmine API key. With something like curl -s -H 'X-Redmine-API-Key: XXX' 'https://progress.opensuse.org/issues/94342.json'
we can access private tickets so the API key seems to be fine. Maybe the implementation within openqa-review does not work there.
Updated by jbaier_cz over 3 years ago
- Related to action #95033: openqa-review fails upon trying to access openqa with no-urlencoded addresses added
Updated by dheidler over 3 years ago
Added a PR for the progress API issue: https://github.com/os-autoinst/openqa_review/pull/164
Updated by okurz over 3 years ago
@cdywan turns out that problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149 as found out by
git bisect start HEAD 1.19.0
git bisect run python3 openqa_review/openqa_review.py --host https://openqa.suse.de -n -r -T --query-issue-status --no-empty-sections --include-softfails --running-threshold=2 --exclude-job-groups '^(Released|Development|old|EOL)' -J https://openqa.suse.de/group_overview/143
so we can revert multiple commits but likely the requests.retry approach is still good. But now we have a better reference with 1.19.0. In the current status with https://openqa.io.suse.de/openqa-review/ now showing basically a single report file this is becoming even more urgent now.
Updated by livdywan over 3 years ago
- Due date changed from 2021-07-07 to 2021-06-29
okurz wrote:
problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still good
I don't follow. There's exactly one commit and that uses Retry
instead of a loop.
Updated by livdywan over 3 years ago
- Due date changed from 2021-06-29 to 2021-07-07
Updated by okurz over 3 years ago
cdywan wrote:
okurz wrote:
problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still goodI don't follow. There's exactly one commit and that uses
Retry
instead of a loop.
Yes, I was also surprised but I believe what "git bisect" is telling us :) Please make sure that the team acts that with urgency, doesn't have to be you.
Updated by livdywan over 3 years ago
okurz wrote:
cdywan wrote:
okurz wrote:
problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still goodI don't follow. There's exactly one commit and that uses
Retry
instead of a loop.Yes, I was also surprised but I believe what "git bisect" is telling us :) Please make sure that the team acts that with urgency, doesn't have to be you.
Ack. I'll propose a new PR handling network errors. So that in a report they should result in a message w/o stopping other reports being generated. This should bring us back to a state similar to the loop which swalloed errors. If not, let's decide tomorrow if we'd rather go back to an older version.
Updated by livdywan over 3 years ago
Good news, I managed to reproduce a HTTPError manually and proposed a PR to handle it: https://github.com/os-autoinst/openqa_review/pull/166
Updated by okurz over 3 years ago
I think this should be "In Progress" as I understand you actively work on this.
As you seem to have problems with reproducing this, please monitor what actually happens in the pipelines: https://gitlab.suse.de/openqa/openqa-review/-/pipelines
For example:
- https://gitlab.suse.de/openqa/openqa-review/-/jobs/483641 shows "WARNING: Uploading artifacts as "archive" to coordinator... failed id=483641 responseStatus=502 Bad Gateway status=502", no one here seems to have mentioned this problem
- https://gitlab.suse.de/openqa/openqa-review/-/jobs/483642#L147 and four other jobs in the same pipeline show
openqa_review.browser.DownloadError: Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=187.1&groupid=110 was not successful after 7 retries: HTTPSConnectionPool(host='openqa.suse.de', port=443): Max retries exceeded with url: /tests/overview?distri=sle&version=15-SP3&build=187.1&groupid=110 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)"))
or similar. Your PR does not handle that as it turns another error also into a DownloadError
Updated by livdywan over 3 years ago
- Status changed from Feedback to In Progress
Ack, I forgot to change it back. I'm having a bit more success with reproducing the specific issues but it seems to.. vary during the day. I even got ssl.SSLCertVerificationError
now after simply restarting my development container.
I'm keeping in mind going back to an older version as a last resort if I can't get a handle on this today.
Updated by livdywan over 3 years ago
I'm also testing with an increased timeout for individual requests, because I noticed that we end up retrying without ever succeeding if the server is too slow to respond faster: https://github.com/os-autoinst/openqa_review/pull/167
Updated by livdywan over 3 years ago
- https://gitlab.suse.de/openqa/openqa-review/-/jobs/483641 shows "WARNING: Uploading artifacts as "archive" to coordinator... failed id=483641 responseStatus=502 Bad Gateway status=502", no one here seems to have mentioned this problem
I've seen this but not sure what it means. The thing is, I can download the file artifacts.zip
and it contains e.g. openqa_opensuse_org_status.html
. So it doesn't seem like it fails in the end?
Updated by livdywan over 3 years ago
Also for further reference, I artifically tested that the report continues if all individual results fail. It will fail, though, if e.g. the overview route is unable to manage 10 seconds after 7 times bevause then we have no data at all. But individual ones don't matter.
Also, https://gitlab.suse.de/openqa/openqa-review/-/jobs/484379 looks really good again.
Updated by livdywan over 3 years ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
Moving to Feedback again and reducing Priority accordingly. I"ll leave the due date, though, and intend to check again tomorrow if it's looking fine.
Updated by okurz over 3 years ago
Yes, very good progress. Thank you! If the next "nightly" job is again problematic we can consider yet another point in time when the reports should be generated. Maybe now we are conflicting with some bigger schedule triggering from qa-maintenance/openQABot
Updated by livdywan over 3 years ago
okurz wrote:
Yes, very good progress. Thank you! If the next "nightly" job is again problematic we can consider yet another point in time when the reports should be generated. Maybe now we are conflicting with some bigger schedule triggering from qa-maintenance/openQABot
I retried several pipelines which ran at different times during the day. The puzzling "upload error" didn't come back so far. Do we consider this good enough?
Updated by okurz over 3 years ago
cdywan wrote:
I retried several pipelines which ran at different times during the day. The puzzling "upload error" didn't come back so far. Do we consider this good enough?
Yes, because we should react on pipeline failures during our normal alert handling anyway