Project

General

Profile

Actions

action #97856

closed

[sporadic] openqa-review pipeline failed: ConnectionResetError size:M

Added by tinita about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-01
Due date:
2021-09-22
% Done:

0%

Estimated time:

Description

Observation

The pipeline failed yesterday: https://gitlab.suse.de/openqa/openqa-review/-/jobs/552027

Traceback (most recent call last):
1613  File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
1614    yield
1615  File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 519, in read
1616    data = self._fp.read(amt) if not fp_closed else b""
1617  File "/usr/lib64/python3.8/http/client.py", line 459, in read
1618    n = self.readinto(b)
1619  File "/usr/lib64/python3.8/http/client.py", line 503, in readinto
1620    n = self.fp.readinto(b)
1621  File "/usr/lib64/python3.8/socket.py", line 669, in readinto
1622    return self._sock.recv_into(b)
1623  File "/usr/lib64/python3.8/ssl.py", line 1241, in recv_into
1624    return self.read(nbytes, buffer)
1625  File "/usr/lib64/python3.8/ssl.py", line 1099, in read
1626    return self._sslobj.read(len, buffer)
1627ConnectionResetError: [Errno 104] Connection reset by peer
1628During handling of the above exception, another exception occurred:
1629Traceback (most recent call last):
1630  File "/usr/lib/python3.8/site-packages/requests/models.py", line 753, in generate
1631    for chunk in self.raw.stream(chunk_size, decode_content=True):
1632  File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 576, in stream
1633    data = self.read(amt=amt, decode_content=decode_content)
1634  File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 541, in read
1635    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
1636  File "/usr/lib64/python3.8/contextlib.py", line 131, in __exit__
1637    self.gen.throw(type, value, traceback)
1638  File "/usr/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
1639    raise ProtocolError("Connection broken: %r" % e, e)
1640urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
1641During handling of the above exception, another exception occurred:
1642Traceback (most recent call last):
1643  File "/usr/bin/openqa-review", line 33, in <module>
1644    sys.exit(load_entry_point('openqa-review==0.0.0', 'console_scripts', 'openqa-review')())
1645  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1557, in main
1646    report = generate_report(args)
1647  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1488, in generate_report
1648    return Report(browser, args, root_url, job_groups)
1649  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1437, in __init__
1650    self.report[k] = self._one_report(v)
1651  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1448, in _one_report
1652    return ProductReport(self.browser, job_group_url, self.root_url, self.args)
1653  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 1162, in __init__
1654    self.reports[arch] = ArchReport(arch, results, args, root_url, progress_browser, bugzilla_browser, browser)
1655  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 898, in __init__
1656    self._search_for_bugrefs_for_softfailures(results)
1657  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 942, in _search_for_bugrefs_for_softfailures
1658    module_url = self._get_url_to_softfailed_module(v["href"])
1659  File "/usr/lib/python3.8/site-packages/openqa_review/openqa_review.py", line 984, in _get_url_to_softfailed_module
1660    details = self.test_browser.get_json("/api/v1/jobs/" + job_url.split("/")[-1] + "/details")
1661  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 77, in get_json
1662    return self.get_page(url, as_json=True, cache=cache)
1663  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 104, in get_page
1664    content = self._get(absolute_url, as_json=as_json)
1665  File "/usr/lib/python3.8/site-packages/openqa_review/browser.py", line 119, in _get
1666    r = http.get(url, auth=self.auth, timeout=30, headers=self.headers)
1667  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 555, in get
1668    return self.request('GET', url, **kwargs)
1669  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
1670    resp = self.send(prep, **send_kwargs)
1671  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 697, in send
1672    r.content
1673  File "/usr/lib/python3.8/site-packages/requests/models.py", line 831, in content
1674    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
1675  File "/usr/lib/python3.8/site-packages/requests/models.py", line 756, in generate
1676    raise ChunkedEncodingError(e)
1677requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

I guess only a retry would help here, but also the error message can possible be made nicer.

Acceptance criteria

  • AC1: At least 10 jobs in the openqa-review pipeline pass without this problem after a potential fix is applied

Suggestions

  • Figure out what "104" in relation to the message "Connection reset by peer" could mean
  • Try to apply "retry" or actual "reconnect" at the right step as we know that this is a sporadic issue so workaround like this should work

Workaround

Retrigger


Related issues 1 (1 open0 closed)

Copied to openQA Project - action #98631: Ensure all environment variables supported by openQA are documentedNew2021-09-012021-09-22

Actions
Actions #1

Updated by livdywan about 3 years ago

tinita wrote:

1677requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))


I guess only a retry would help here, but also the error message can possible be made nicer.

We could consider separate timeouts for read and connect e.g. , timeout=(30, 500) although we're already being very generous.

Alternatively we can add 104 to status_forcelist=[...].

Actions #2

Updated by okurz about 3 years ago

  • Subject changed from openqa-review pipeline failed: ConnectionResetError to [sporadic] openqa-review pipeline failed: ConnectionResetError size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by tinita about 3 years ago

We had several instances of timeouts and ConnectionResetError.

Yesterday we had a timeout failure.
https://gitlab.suse.de/openqa/openqa-review/-/jobs/562469

Since it repeatedly failed for job group 309, I investigated and found out that the /group_overview/\d is fetching way too much comments.
Fixed that in https://github.com/os-autoinst/openQA/pull/4170 which was merged. So I'm positive that we won't see timeouts after the next deployment.

Today we had two failures.
One with the ConnectionResetError:
https://gitlab.suse.de/openqa/openqa-review/-/jobs/566257

and one timeout (job group 309) again.

Actions #4

Updated by kraih about 3 years ago

After some discussions on Slack (and seeing graphs about the number of RST TCP packets we send), i have the strong suspicion that we are having some actions in the WebAPI that run too slow for their configured inactivity_timeout. Which causes the connections to be closed before a response could be generated. The code to generate the response would still run to completion, it could just never actually send the response, since the connection is already closed. Possible solutions would be to make the code run faster and/or to increase the inactivity_timeout.

Actions #5

Updated by livdywan about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

tinita wrote:

We had several instances of timeouts and ConnectionResetError.

https://github.com/os-autoinst/openqa_review/pull/176

Actions #6

Updated by openqa_review about 3 years ago

  • Due date set to 2021-09-22

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by tinita about 3 years ago

We had timeouts for two jobs again, and since retriggering will likely not succeed we will wait until https://github.com/os-autoinst/openQA/pull/4170 is deployed on osd.

Actions #8

Updated by tinita about 3 years ago

After PR 4170 /group_overview/309.json went from around 55s to 5.5s.

Actions #9

Updated by okurz about 3 years ago

git grep 'inactivity.*' in openQA reveals that we already set a higher than default timeout but only for worker-communication, not the webUI. I suggest to use an approach same as

lib/OpenQA/Worker.pm:    $client->ua->inactivity_timeout($ENV{OPENQA_WORKER_CACHE_SERVICE_CHECK_INACTIVITY_TIMEOUT} // 10);

but for the webUI, accordingly, i.e. identify the routes which can take longer and use a higher timeout and allow to configure so something like:

$self->inactivity_timeout($ENV{OPENQA_WEBUI_INACTIVITY_TIMEOUT} // 90);

or if we have multiple routes with different requirements then consider a factor.

Actions #10

Updated by livdywan about 3 years ago

okurz wrote:

but for the webUI, accordingly, i.e. identify the routes which can take longer and use a higher timeout and allow to configure so something like:

$self->inactivity_timeout($ENV{OPENQA_WEBUI_INACTIVITY_TIMEOUT} // 90);

I proposed a default 90s/ configurable timeout like this to the (parent) overview route now: https://github.com/os-autoinst/openQA/pull/4190

It's not "the web UI". But this is what we discussed in the call so I went with that.

Actions #11

Updated by livdywan about 3 years ago

  • Status changed from In Progress to Feedback

And I'll make it Feedback since I feel like it's more about agreement than implementation.

Actions #12

Updated by tinita about 3 years ago

The failure on Saturday was: ERROR: Job failed: pod "runner-rgb6rjcn-project-4884-concurrent-0xn8x5" status is "Failed"
So this was unrelated. #96827

Actions #13

Updated by livdywan about 3 years ago

  • Copied to action #98631: Ensure all environment variables supported by openQA are documented added
Actions #14

Updated by okurz about 3 years ago

  • Status changed from Feedback to Resolved

so we already applied actual changes to openQA. And now we have at least 10 passed pipelines on https://gitlab.suse.de/openqa/openqa-review/-/pipelines so to me this looks resolved now. Reopen if I overlooked something.

Actions

Also available in: Atom PDF