Project

General

Profile

action #93943

openqa-review pipeline fails accessing OSD test overview pages sometimes, more retries?

Added by okurz about 1 month ago. Updated 21 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2021-06-14
Due date:
2021-07-07
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/openqa-review/-/jobs/458168#L70 shows

WARNING:openqa_review.browser:Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=20210610-2&groupid=366 was not successful after multiple retries, giving up
Traceback (most recent call last):
  File "/usr/bin/openqa-review", line 11, in <module>
    load_entry_point('openqa-review==0.0.0', 'console_scripts', 'openqa-review')()
  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1299, in main
    report = generate_report(args)
  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1233, in generate_report
    return Report(browser, args, root_url, job_groups)
  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1184, in __init__
    self.report[k] = self._one_report(v)
  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1195, in _one_report
    return ProductReport(self.browser, job_group_url, self.root_url, self.args)
  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 973, in __init__
    previous_details = browser.get_soup(previous_url)
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 72, in get_soup
    return BeautifulSoup(self.get_page(url), 'html.parser')
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 103, in get_page
    content = self._get(absolute_url, as_json=as_json)
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 147, in _get
    raise DownloadError(msg)
openqa_review.browser.DownloadError: Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=20210610-2&groupid=366 was not successful after multiple retries, giving up

Expected result

  • AC1: Daily reports are generated in a stable manner

Suggestions

Can we try with more retries?


Related issues

Related to QA - action #95033: openqa-review fails upon trying to access openqa with no-urlencoded addressesResolved2021-07-022021-07-16

History

#1 Updated by cdywan about 1 month ago

I don't know if these are related. But it looks like osd is responsive. And it's only https://openqa.suse.de/tests/overview requests that seem to time out. It doesn't look like the network is the problem:

ERROR:openqa_review.openqa_review:Could not find any soft failure reference within details of soft-failed job 'https://openqa.suse.de/tests/5590738'. Could be deleted workaround needle?.
ERROR:openqa_review.openqa_review:Could not find any soft failure reference within details of soft-failed job 'https://openqa.suse.de/tests/5590756'. Could be deleted workaround needle?.
[...]
ERROR:openqa_review.openqa_review:Failed to process {'state': 'NEW_SOFT_ISSUE', 'href': '/tests/6230452', 'failedmodules': []} with error Request to https://openqa.suse.de/tests/6230452/file/details-console_reboot#1.json was not successful, status code: 404. Skipping current result

Maybe related to #92854 🤔️

#2 Updated by okurz about 1 month ago

"Could not find any soft failure" and the other error are not related, these are different problems.

#3 Updated by cdywan about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to cdywan

Ack. For now I'm going to try and improve the error. As we're discussing it in chat it came up that we should clarify if it's 404 or 500 or timeout.

#4 Updated by cdywan about 1 month ago

For reference the command to reproduce locally:

mkdir -p
python3 ./openqa_review/openqa_review.py --host https://openqa.suse.de -n -r -T --no-empty-sections --include-softfails --running-threshold=2 --exclude-job-groups '^(Released|Development|old)' --save --save-dir out

(Side note, I could not avoid having valid bugzilla credentials even after dropping --query-issue-status, maybe an unrelated bug)

To reproduce the error I changed the host to test.openqa.suse.de (.invalid doesn't work here).

I want to say this is by definition not a 502/503/504 since the error would say "was not successful, status code: ". But the code should preserve at least the last exception or request status. Currently due to the retries that gets lost.

https://github.com/os-autoinst/openqa_review/pull/147

#5 Updated by okurz about 1 month ago

the format of your comment is broken.

(Side note, I could not avoid having valid bugzilla credentials even after dropping --query-issue-status, maybe an unrelated bug)

That's expected because of the option --reminder-comment-on-issues which I suggest you avoid in local testing with live instances as that means that your user account will write comments on bugzilla.

#6 Updated by cdywan about 1 month ago

Thanks! Fixed the formatting (had some rogue line breaks in there) and dropped --reminder-comment-on-issues from the command.

#7 Updated by openqa_review about 1 month ago

  • Due date set to 2021-06-29

Setting due date based on mean cycle time of SUSE QE Tools

#8 Updated by cdywan about 1 month ago

  • Due date changed from 2021-06-29 to 2021-06-21

The PR got delayed due to some style questions. Hoping to merge and check the errors today.

#9 Updated by okurz about 1 month ago

https://github.com/os-autoinst/openqa_review/pull/147 merged. The gitlab CI pipeline uses container images using the released packages of openqa-review from openSUSE Tumbleweed. So after I would create a release and create a submit request to openSUSE Factory some days can pass until the gitlab CI pipeline has access to a more recent version.

#10 Updated by okurz about 1 month ago

How about using /tests/overview.json instead of HTML document parsing and the suggested additional retries?

#11 Updated by cdywan about 1 month ago

Ack, I'm going to propose additional PRs. I had thought it was gonna go quicker. In hindsight I should've prepared those in parallel.

#12 Updated by cdywan about 1 month ago

okurz wrote:

How about [...] the suggested additional retries?

https://github.com/os-autoinst/openqa_review/pull/148

#13 Updated by cdywan about 1 month ago

cdywan wrote:

okurz wrote:

How about [...] the suggested additional retries?

https://github.com/os-autoinst/openqa_review/pull/148

After proposing the simple increase I also pondered a little more and decided to try and replace the loop and else approach:

https://github.com/os-autoinst/openqa_review/pull/149

This is similar to what's described here although I didn't create a new class for a single use: https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/

#14 Updated by cdywan about 1 month ago

okurz wrote:

How about using /tests/overview.json instead of HTML document parsing[...]

The "soup" parsing is not very obvious at first, code like find_all(class_="badge") w/ no documentation of the expected data, so I had to do some reverse engineering. There's also no test coverage so I'm looking into that first. Iterating changes with real requests is way too slow.

For reference the relevant "soup" would look something like this:

    <div id="summary" class="card border-success">
        <div class="card-header">
            Overall Summary of
                <strong><a href="/group_overview/4">openSUSE Tumbleweed PowerPC</a></strong>
                build 20160917
            <div class="time-params">
                    showing latest jobs,
                    <a href="/tests/overview?distri=opensuse&amp;version=Tumbleweed&amp;build=20160917&amp;groupid=4&amp;t=2021-06-22+10%3A54%3A30+%2B0000">overview fixed to the current time</a>
            </div>
        </div>
        <div class="card-body">
            Passed: <span class="badge badge-success">0</span>
            Failed: <span class="badge badge-danger">0</span>
        </div>
    </div>

    <div class="card border-danger" id="summary">
<div class="card-header">Overall Summary of <strong><a href="/group_overview/377">Create_hdd_stable_hosts</a></strong> build 15_GM <div class="time-params"> showing latest jobs,  <a href="/tests/overview?distri=sle&amp;version=15&amp;build=15_GM&amp;groupid=377&amp;t=2021-06-21+16%3A23%3A57+%2B
000">overview fixed to the current time</a></div></div><div class="card-body">Passed: <span class="badge badge-success">0</span>Soft-Failed:<span class="badge badge-warning">2</span>Failed: <span class="badge badge-danger">1</span></div></div>
<i class="status fa fa-circle result_passed" title="Done: passed"></i>
</a>

<div class="card border-danger" id="summary">
<div class="card-header">Overall Summary of <strong><a href="/group_overview/377">Create_hdd_stable_hosts</a></strong>build 15-SP3_GM<div class="time-params">showing latest jobs,<a href="/tests/overview?distri=sle&amp;version=15-SP3&amp;build=15-SP3_GM&amp;groupid=377&amp;t=2021-06-21+16%3A26%3
A08+%2B0000">overview fixed to the current time</a></div></div><div class="card-body">Passed: <span class="badge badge-success">3</span>Incomplete:<span class="badge badge-secondary">1</span>Failed: <span class="badge badge-danger">0</span></div></div>
<i class="status fa fa-circle result_incomplete" title="Done: incomplete"></i>

<h3>Flavor: Installer-DVD-POST</h3>                                                                 
<table class="overview fixedheader table table-striped table-hover" id="results_Installer-DVD-POST">
<thead>                                                                                             
<tr>                                                                                                
<th>Test</th>                                                                                       
<th id="flavor_Installer-DVD-POST_arch_ppc64le">ppc64le</th>                                        
<th id="flavor_Installer-DVD-POST_arch_s390x">s390x</th>                                            
<th id="flavor_Installer-DVD-POST_arch_x86_64">x86_64</th>                                          
</tr>                         
</thead>                    
<!-- body omitted -->
</table>

#15 Updated by okurz about 1 month ago

  • Due date changed from 2021-06-21 to 2021-06-29

as discussed with cdywan setting due date back to original cycle time based estimate as cdywan learned that we need to wait a bit longer to incorporate proper package updates through OBS.

#16 Updated by okurz about 1 month ago

  • Status changed from In Progress to Feedback

PR merged, created new release https://build.opensuse.org/request/show/901431 . We should wait for the submission to be accepted into Factory, a new Tumbleweed snapshot after that and then a new container image that is used in the gitlab CI pipeline.

EDIT: https://build.opensuse.org/request/show/901431 accepted

#17 Updated by jbaier_cz about 1 month ago

Not sure, if that is related, but it seems that the overview page is still a problem sometimes, see https://gitlab.suse.de/openqa/openqa-review/-/jobs/474652

#18 Updated by cdywan about 1 month ago

jbaier_cz wrote:

Not sure, if that is related, but it seems that the overview page is still a problem sometimes, see https://gitlab.suse.de/openqa/openqa-review/-/jobs/474652

Yes. My PR wasn't a magical fix but meant to make it more robust and reveal errors rather than ignoring them.

WARNING:urllib3.connectionpool:Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)")': /api/v1/jobs/5990475/details
[...]
WARNING:urllib3.connectionpool:Retrying (Retry(total=6, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)")': /tests/overview?distri=sle&version=15-SP3&build=186.1&groupid=143

So these are expected. I'm not sure what this one is, though. It looks like we're getting corrupt JSON. I'm thinking we should dump the corrupt data (which might be valid unexpected data). I've not seen this locally:

  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 996, in _get_bugref_for_softfailed_module
    details_json = self.test_browser.get_json(details_url)
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 75, in get_json
    return self.get_page(url, as_json=True, cache=cache)
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 102, in get_page
    content = self._get(absolute_url, as_json=as_json)
  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 147, in _get
    content = r.json() if as_json else r.content.decode("utf8")
[...]
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

#20 Updated by jbaier_cz about 1 month ago

I will probably state the obvious, judging from json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0), the content is not a JSON or (more probably) the content is just empty:

import json
>>> json.loads('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

#21 Updated by cdywan 28 days ago

  • Due date changed from 2021-06-29 to 2021-07-02

Proposed a small PR to raise exceptions for HTTP errors https://github.com/os-autoinst/openqa_review/pull/163 after discussing this on Jitsi. Hypothesis being that we see an empty response because the request failed after the last retry w/o raising an exception.

#22 Updated by cdywan 27 days ago

  • Due date changed from 2021-07-02 to 2021-07-07

cdywan wrote:

Proposed a small PR to raise exceptions for HTTP errors https://github.com/os-autoinst/openqa_review/pull/163 after discussing this on Jitsi. Hypothesis being that we see an empty response because the request failed after the last retry w/o raising an exception.

PR got merged, still needs to be release/deployed hence moving the due date

#24 Updated by okurz 27 days ago

We discussed it after the weekly (okurz, cdywan, jbaier, dheidler) and applied the following changes:

EDIT: Then we see requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://progress.opensuse.org/issues/94342.json

https://gitlab.suse.de/openqa/openqa-review/-/settings/ci_cd has the redmine API key. With something like curl -s -H 'X-Redmine-API-Key: XXX' 'https://progress.opensuse.org/issues/94342.json' we can access private tickets so the API key seems to be fine. Maybe the implementation within openqa-review does not work there.

#25 Updated by jbaier_cz 27 days ago

  • Related to action #95033: openqa-review fails upon trying to access openqa with no-urlencoded addresses added

#26 Updated by dheidler 27 days ago

Added a PR for the progress API issue: https://github.com/os-autoinst/openqa_review/pull/164

#27 Updated by okurz 24 days ago

cdywan turns out that problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149 as found out by

git bisect start HEAD 1.19.0 
git bisect run python3 openqa_review/openqa_review.py --host https://openqa.suse.de -n -r -T --query-issue-status --no-empty-sections --include-softfails --running-threshold=2 --exclude-job-groups '^(Released|Development|old|EOL)' -J https://openqa.suse.de/group_overview/143

so we can revert multiple commits but likely the requests.retry approach is still good. But now we have a better reference with 1.19.0. In the current status with https://openqa.io.suse.de/openqa-review/ now showing basically a single report file this is becoming even more urgent now.

#28 Updated by cdywan 24 days ago

  • Due date changed from 2021-07-07 to 2021-06-29

okurz wrote:

problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still good

I don't follow. There's exactly one commit and that uses Retry instead of a loop.

#29 Updated by cdywan 24 days ago

  • Due date changed from 2021-06-29 to 2021-07-07

#30 Updated by okurz 23 days ago

cdywan wrote:

okurz wrote:

problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still good

I don't follow. There's exactly one commit and that uses Retry instead of a loop.

Yes, I was also surprised but I believe what "git bisect" is telling us :) Please make sure that the team acts that with urgency, doesn't have to be you.

#31 Updated by cdywan 23 days ago

okurz wrote:

cdywan wrote:

okurz wrote:

problems have actually been worsened with 60ebc73 from https://github.com/os-autoinst/openqa_review/pull/149
[...]
so we can revert multiple commits but likely the requests.retry approach is still good

I don't follow. There's exactly one commit and that uses Retry instead of a loop.

Yes, I was also surprised but I believe what "git bisect" is telling us :) Please make sure that the team acts that with urgency, doesn't have to be you.

Ack. I'll propose a new PR handling network errors. So that in a report they should result in a message w/o stopping other reports being generated. This should bring us back to a state similar to the loop which swalloed errors. If not, let's decide tomorrow if we'd rather go back to an older version.

#32 Updated by cdywan 23 days ago

Good news, I managed to reproduce a HTTPError manually and proposed a PR to handle it: https://github.com/os-autoinst/openqa_review/pull/166

#33 Updated by okurz 23 days ago

I think this should be "In Progress" as I understand you actively work on this.

As you seem to have problems with reproducing this, please monitor what actually happens in the pipelines: https://gitlab.suse.de/openqa/openqa-review/-/pipelines

For example:

  • https://gitlab.suse.de/openqa/openqa-review/-/jobs/483641 shows "WARNING: Uploading artifacts as "archive" to coordinator... failed id=483641 responseStatus=502 Bad Gateway status=502", no one here seems to have mentioned this problem
  • https://gitlab.suse.de/openqa/openqa-review/-/jobs/483642#L147 and four other jobs in the same pipeline show openqa_review.browser.DownloadError: Request to https://openqa.suse.de/tests/overview?distri=sle&version=15-SP3&build=187.1&groupid=110 was not successful after 7 retries: HTTPSConnectionPool(host='openqa.suse.de', port=443): Max retries exceeded with url: /tests/overview?distri=sle&version=15-SP3&build=187.1&groupid=110 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='openqa.suse.de', port=443): Read timed out. (read timeout=2.5)")) or similar. Your PR does not handle that as it turns another error also into a DownloadError

#34 Updated by cdywan 23 days ago

  • Status changed from Feedback to In Progress

Ack, I forgot to change it back. I'm having a bit more success with reproducing the specific issues but it seems to.. vary during the day. I even got ssl.SSLCertVerificationError now after simply restarting my development container.

I'm keeping in mind going back to an older version as a last resort if I can't get a handle on this today.

#35 Updated by cdywan 23 days ago

I'm also testing with an increased timeout for individual requests, because I noticed that we end up retrying without ever succeeding if the server is too slow to respond faster: https://github.com/os-autoinst/openqa_review/pull/167

#36 Updated by cdywan 22 days ago

I've seen this but not sure what it means. The thing is, I can download the file artifacts.zip and it contains e.g. openqa_opensuse_org_status.html. So it doesn't seem like it fails in the end?

#37 Updated by cdywan 22 days ago

Also for further reference, I artifically tested that the report continues if all individual results fail. It will fail, though, if e.g. the overview route is unable to manage 10 seconds after 7 times bevause then we have no data at all. But individual ones don't matter.

Also, https://gitlab.suse.de/openqa/openqa-review/-/jobs/484379 looks really good again.

#38 Updated by cdywan 22 days ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

Moving to Feedback again and reducing Priority accordingly. I"ll leave the due date, though, and intend to check again tomorrow if it's looking fine.

#39 Updated by okurz 22 days ago

Yes, very good progress. Thank you! If the next "nightly" job is again problematic we can consider yet another point in time when the reports should be generated. Maybe now we are conflicting with some bigger schedule triggering from qa-maintenance/openQABot

#40 Updated by cdywan 21 days ago

okurz wrote:

Yes, very good progress. Thank you! If the next "nightly" job is again problematic we can consider yet another point in time when the reports should be generated. Maybe now we are conflicting with some bigger schedule triggering from qa-maintenance/openQABot

I retried several pipelines which ran at different times during the day. The puzzling "upload error" didn't come back so far. Do we consider this good enough?

#41 Updated by okurz 21 days ago

cdywan wrote:

I retried several pipelines which ran at different times during the day. The puzzling "upload error" didn't come back so far. Do we consider this good enough?

Yes, because we should react on pipeline failures during our normal alert handling anyway

#42 Updated by cdywan 21 days ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF