action #97856
closed[sporadic] openqa-review pipeline failed: ConnectionResetError size:M
Description
Observation¶
The pipeline failed yesterday: https://gitlab.suse.de/openqa/openqa-review/-/jobs/552027
Traceback (most recent call last):
1613 File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
1614 yield
1615 File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
1616 data = self._fp.read(amt) if not fp_closed else b""
1617 File "/usr/lib64/python3.8/http/client.py", line 459, in read
1618 n = self.readinto(b)
1619 File "/usr/lib64/python3.8/http/client.py", line 503, in readinto
1620 n = self.fp.readinto(b)
1621 File "/usr/lib64/python3.8/socket.py", line 669, in readinto
1622 return self._sock.recv_into(b)
1623 File "/usr/lib64/python3.8/ssl.py", line 1241, in recv_into
1624 return self.read(nbytes, buffer)
1625 File "/usr/lib64/python3.8/ssl.py", line 1099, in read
1626 return self._sslobj.read(len, buffer)
1627ConnectionResetError: [Errno 104] Connection reset by peer
1628During handling of the above exception, another exception occurred:
1629Traceback (most recent call last):
1630 File "/usr/lib/python3.8/site-packages/requests/models.py", line 753, in generate
1631 for chunk in self.raw.stream(chunk_size, decode_content=True):
1632 File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
1633 data = self.read(amt=amt, decode_content=decode_content)
1634 File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
1635 raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
1636 File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
1637 self.gen.throw(type, value, traceback)
1638 File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
1639 raise ProtocolError("Connection broken: %r" % e, e)
1640urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
1641During handling of the above exception, another exception occurred:
1642Traceback (most recent call last):
1643 File "/usr/bin/openqa-review", line 33, in <module>
1644 sys.exit(load_entry_point('openqa-review==0.0.0', 'console_scripts', 'openqa-review')())
1645 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1557, in main
1646 report = generate_report(args)
1647 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1488, in generate_report
1648 return Report(browser, args, root_url, job_groups)
1649 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1437, in __init__
1650 self.report[k] = self._one_report(v)
1651 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1448, in _one_report
1652 return ProductReport(self.browser, job_group_url, self.root_url, self.args)
1653 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1162, in __init__
1654 self.reports[arch] = ArchReport(arch, results, args, root_url, progress_browser, bugzilla_browser, browser)
1655 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 898, in __init__
1656 self._search_for_bugrefs_for_softfailures(results)
1657 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 942, in _search_for_bugrefs_for_softfailures
1658 module_url = self._get_url_to_softfailed_module(v["href"])
1659 File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 984, in _get_url_to_softfailed_module
1660 details = self.test_browser.get_json("/api/v1/jobs/" + job_url.split("/")[-1] + "/details")
1661 File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 77, in get_json
1662 return self.get_page(url, as_json=True, cache=cache)
1663 File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 104, in get_page
1664 content = self._get(absolute_url, as_json=as_json)
1665 File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 119, in _get
1666 r = http.get(url, auth=self.auth, timeout=30, headers=self.headers)
1667 File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 555, in get
1668 return self.request('GET', url, **kwargs)
1669 File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
1670 resp = self.send(prep, **send_kwargs)
1671 File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 697, in send
1672 r.content
1673 File "/usr/lib/python3.8/site-packages/requests/models.py", line 831, in content
1674 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
1675 File "/usr/lib/python3.8/site-packages/requests/models.py", line 756, in generate
1676 raise ChunkedEncodingError(e)
1677requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
I guess only a retry would help here, but also the error message can possible be made nicer.
Acceptance criteria¶
- AC1: At least 10 jobs in the openqa-review pipeline pass without this problem after a potential fix is applied
Suggestions¶
- Figure out what "104" in relation to the message "Connection reset by peer" could mean
- Try to apply "retry" or actual "reconnect" at the right step as we know that this is a sporadic issue so workaround like this should work
Workaround¶
Retrigger
Updated by livdywan over 3 years ago
tinita wrote:
1677requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
I guess only a retry would help here, but also the error message can possible be made nicer.
We could consider separate timeouts for read and connect e.g. , timeout=(30, 500)
although we're already being very generous.
Alternatively we can add 104
to status_forcelist=[...]
.
Updated by okurz over 3 years ago
- Subject changed from openqa-review pipeline failed: ConnectionResetError to [sporadic] openqa-review pipeline failed: ConnectionResetError size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita over 3 years ago
We had several instances of timeouts and ConnectionResetError.
Yesterday we had a timeout failure.
https://gitlab.suse.de/openqa/openqa-review/-/jobs/562469
Since it repeatedly failed for job group 309, I investigated and found out that the /group_overview/\d
is fetching way too much comments.
Fixed that in https://github.com/os-autoinst/openQA/pull/4170 which was merged. So I'm positive that we won't see timeouts after the next deployment.
Today we had two failures.
One with the ConnectionResetError:
https://gitlab.suse.de/openqa/openqa-review/-/jobs/566257
and one timeout (job group 309) again.
Updated by kraih over 3 years ago
After some discussions on Slack (and seeing graphs about the number of RST TCP packets we send), i have the strong suspicion that we are having some actions in the WebAPI that run too slow for their configured inactivity_timeout
. Which causes the connections to be closed before a response could be generated. The code to generate the response would still run to completion, it could just never actually send the response, since the connection is already closed. Possible solutions would be to make the code run faster and/or to increase the inactivity_timeout
.
Updated by livdywan over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
tinita wrote:
We had several instances of timeouts and ConnectionResetError.
Updated by openqa_review over 3 years ago
- Due date set to 2021-09-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita over 3 years ago
We had timeouts for two jobs again, and since retriggering will likely not succeed we will wait until https://github.com/os-autoinst/openQA/pull/4170 is deployed on osd.
Updated by tinita over 3 years ago
After PR 4170 /group_overview/309.json
went from around 55s to 5.5s.
Updated by okurz over 3 years ago
git grep 'inactivity.*'
in openQA reveals that we already set a higher than default timeout but only for worker-communication, not the webUI. I suggest to use an approach same as
lib/OpenQA/Worker.pm: $client->ua->inactivity_timeout($ENV{OPENQA_WORKER_CACHE_SERVICE_CHECK_INACTIVITY_TIMEOUT} // 10);
but for the webUI, accordingly, i.e. identify the routes which can take longer and use a higher timeout and allow to configure so something like:
$self->inactivity_timeout($ENV{OPENQA_WEBUI_INACTIVITY_TIMEOUT} // 90);
or if we have multiple routes with different requirements then consider a factor.
Updated by livdywan over 3 years ago
okurz wrote:
but for the webUI, accordingly, i.e. identify the routes which can take longer and use a higher timeout and allow to configure so something like:
$self->inactivity_timeout($ENV{OPENQA_WEBUI_INACTIVITY_TIMEOUT} // 90);
I proposed a default 90s/ configurable timeout like this to the (parent) overview route now: https://github.com/os-autoinst/openQA/pull/4190
It's not "the web UI". But this is what we discussed in the call so I went with that.
Updated by livdywan over 3 years ago
- Status changed from In Progress to Feedback
And I'll make it Feedback since I feel like it's more about agreement than implementation.
Updated by tinita over 3 years ago
The failure on Saturday was: ERROR: Job failed: pod "runner-rgb6rjcn-project-4884-concurrent-0xn8x5" status is "Failed"
So this was unrelated. #96827
Updated by livdywan over 3 years ago
- Copied to action #98631: Ensure all environment variables supported by openQA are documented added
Updated by okurz about 3 years ago
- Status changed from Feedback to Resolved
so we already applied actual changes to openQA. And now we have at least 10 passed pipelines on https://gitlab.suse.de/openqa/openqa-review/-/pipelines so to me this looks resolved now. Reopen if I overlooked something.