Project

General

Profile

action #105603

openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" size:M

Added by nicksinger 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
2021-12-16
Due date:
% Done:

0%

Estimated time:

Description

openQABot pipeline seem to fail roughly every 3 hours with a message similar to:

ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)

Unfortunately the artifact/log doesn't show further information what actually is wrong but the message looks a little bit like it tries to parse a broken json response. Therefore I think this is different to what is described in poo#104085

Acceptance criteria

  • AC1: Erroneous SMELT data due to network issues is not causing critical failures

Suggestions


Related issues

Related to QA - action #107671: No aggregate maintenance runs scheduled today on osd size:MResolved

Copied from openQA Infrastructure - action #104085: openQABot pipeline failed with terminating connection due to administrator command size:SResolved2021-12-16

History

#1 Updated by nicksinger 4 months ago

  • Copied from action #104085: openQABot pipeline failed with terminating connection due to administrator command size:S added

#2 Updated by osukup 4 months ago

something different but same cause - problems on SMELT or IBS side

#3 Updated by okurz 4 months ago

nicksinger wrote:

openQABot pipeline seem to fail roughly every 3 hours with a message similar to:

ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)

Unfortunately the artifact/log doesn't show further information what actually is wrong but the message looks a little bit like it tries to parse a broken json response. Therefore I think this is different to what is described in poo#104085

This looks like something that should be workaround-able from openQABot side though.

#4 Updated by jbaier_cz 4 months ago

okurz wrote:

This looks like something that should be workaround-able from openQABot side though.

Depends on the definition of what is a workaround in this case, we might probably want to increase a verbosity a little bit to see the data; apart from that no data == no jobs, it is probably not an error on our side, but I am not sure if we want to mark such pipelines as succeeded.

#5 Updated by okurz 4 months ago

Yes, I agree that the problem most likely is caused in an external dependency which we can't fix but just cope with. But an error message Expecting value: line 4 column 1 (char 3) means that the source code of openQABot does not handle the situation as good as it could be considering the dependency on something external.

Example of bad pseudo code

data = get_data_from_remote()
read data[42]   # because we assume that data[42] always exists

better: Check for validity of data before working with it, a.g. with assert. Or use proper exception handling to handle the situation of unexpected or no data received.

#6 Updated by jbaier_cz 4 months ago

I understand what you mean and I will disagree with you this time. This is not a case similar to your example here. As you can see the error itself is from the https://gitlab.suse.de/qa-maintenance/openQABot/-/blob/master/openqabot/openqabot.py#L82 where the Exception is handled somehow (so at we do not get a stack trace). The exception itself Expecting value: line 4 column 1 (char 3) is not from the openqabot (and there is also no such string in the source code), it is most likely from JSONDecoder at https://gitlab.suse.de/qa-maintenance/openQABot/-/blob/master/openqabot/update/mr.py#L90 as the result from the SMELT was probably not a valid JSON. The exception JSONDecodeError was properly handled, hence the error message. I agree it could be improved, for example by being extra verbose and adding a note about error in JSON parsing and printing the problematic response itself.

#7 Updated by nicksinger 4 months ago

jbaier_cz wrote:

I agree it could be improved, for example by being extra verbose and adding a note about error in JSON parsing and printing the problematic response itself.

I'd agree that at least logging of the problematic json would be helpful (or append it into the artifacts). This way we could help the SMELT developers to maybe even fix a bug.

#8 Updated by jbaier_cz 4 months ago

nicksinger wrote:

I'd agree that at least logging of the problematic json would be helpful (or append it into the artifacts). This way we could help the SMELT developers to maybe even fix a bug.

So let's do something about it: https://gitlab.suse.de/qa-maintenance/openQABot/-/merge_requests/90

#9 Updated by cdywan 3 months ago

  • Subject changed from openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" to openQABot pipeline failed: "ERROR:root:Something bad happended during reading MR data from SMELT/IBS: Expecting value: line 4 column 1 (char 3)" size:M
  • Description updated (diff)
  • Status changed from New to Workable

#10 Updated by jbaier_cz 3 months ago

  • Status changed from Workable to Resolved
  • Assignee set to jbaier_cz

Unfortunately the artifact/log doesn't show further information

This issue has been targeted by the MR. As the problem with SMELT/IBS did not reappear (even during maintenance window on Thu) I will kindly resolve this ticket. We can reopen if necessary (with more info).

#11 Updated by cdywan 3 months ago

  • Status changed from Resolved to Feedback

#12 Updated by cdywan 3 months ago

jbaier_cz wrote:

Unfortunately the artifact/log doesn't show further information

This issue has been targeted by the MR. As the problem with SMELT/IBS did not reappear (even during maintenance window on Thu) I will kindly resolve this ticket. We can reopen if necessary (with more info).

I'm afraid your wish was granted an a job failed because a HTTP 500 error page was returned with a HTML body.

#13 Updated by jbaier_cz 3 months ago

cdywan wrote:

jbaier_cz wrote:

Unfortunately the artifact/log doesn't show further information

This issue has been targeted by the MR. As the problem with SMELT/IBS did not reappear (even during maintenance window on Thu) I will kindly resolve this ticket. We can reopen if necessary (with more info).

I'm afraid your wish was granted an a job failed because a HTTP 500 error page was returned with a HTML body.

Yeah, so now we now for sure, it is on the SMELT side. What do you suggest to do?

#14 Updated by okurz 3 months ago

jbaier_cz wrote:

Yeah, so now we now for sure, it is on the SMELT side. What do you suggest to do?

well, how would you handle as a user? I guess retry with a sensible sleep time in between and abort retrying after a certain number of tries, right?

#15 Updated by jbaier_cz 3 months ago

okurz wrote:

jbaier_cz wrote:

Yeah, so now we now for sure, it is on the SMELT side. What do you suggest to do?

well, how would you handle as a user? I guess retry with a sensible sleep time in between and abort retrying after a certain number of tries, right?

Hm, what sleep time do we talk about? A few seconds or a few minutes? We already have that (it was already implemented) and it did not help. What about an hour? Oh, well, that is already the next run of this pipeline. So the "user approach" is already implemented. Now we are in the "abort retrying" phase.

#16 Updated by okurz 3 months ago

jbaier_cz wrote:

okurz wrote:

jbaier_cz wrote:

Yeah, so now we now for sure, it is on the SMELT side. What do you suggest to do?

well, how would you handle as a user? I guess retry with a sensible sleep time in between and abort retrying after a certain number of tries, right?

Hm, what sleep time do we talk about? A few seconds or a few minutes? We already have that (it was already implemented) and it did not help. What about an hour? Oh, well, that is already the next run of this pipeline. So the "user approach" is already implemented. Now we are in the "abort retrying" phase.

Right. Eventually we should abort and treat it as an error, otherwise we won't learn about problematic situations that are not fixed by retrying and need our help. IMHO it would make sense to still retry longer, so yes, why not an hour.

#17 Updated by jbaier_cz 3 months ago

okurz wrote:

Right. Eventually we should abort and treat it as an error, otherwise we won't learn about problematic situations that are not fixed by retrying and need our help.

Agree. The last error is this case.

IMHO it would make sense to still retry longer, so yes, why not an hour.

Well, in that case. This is also already implemented. The pipeline runs every hour.

#18 Updated by okurz 3 months ago

Yes, I got that. But doing a retry within the individual jobs only for some seconds or minutes would mean we abort with a failure which we try to treat seriously as an alert. So IMHO it's better to handle within the scripts, i.e. openQABot itself, to not raise an alert when retrying at a later time would still work.

#19 Updated by jbaier_cz 3 months ago

okurz wrote:

So IMHO it's better to handle within the scripts, i.e. openQABot itself, to not raise an alert when retrying at a later time would still work.

I see your point. However I am not sure if that is actually doable (or at least without some hardcore hacking) inside Gitlab pipeline. We also do not want to retry to much as we can then end up with multiple jobs running in parallel due to long retry period.

The logs clearly shows the problem, we received 5xx error message from Apache instead of a valid reply. Without the data, the pipeline can't continue. As this is problem not on our side, I would suggest to close the ticket unless some have an idea how to solve that without investing much time into a workaround. IMHO the optimal solution would be to send the error e-mail after two consecutive pipeline failures, but I do not think that could be configured in our Gitlab instance.

#20 Updated by okurz 3 months ago

jbaier_cz wrote:

okurz wrote:

So IMHO it's better to handle within the scripts, i.e. openQABot itself, to not raise an alert when retrying at a later time would still work.

I see your point. However I am not sure if that is actually doable (or at least without some hardcore hacking) inside Gitlab pipeline. We also do not want to retry to much as we can then end up with multiple jobs running in parallel due to long retry period.

That can easily be prevented within gitlab CI by not scheduling any more subsequent pipelines as long as the previous one is not yet finished.

The logs clearly shows the problem, we received 5xx error message from Apache instead of a valid reply. Without the data, the pipeline can't continue. As this is problem not on our side, I would suggest to close the ticket unless some have an idea how to solve that without investing much time into a workaround. IMHO the optimal solution would be to send the error e-mail after two consecutive pipeline failures, but I do not think that could be configured in our Gitlab instance.

Yeah I guess that is not something that gitlab CI provides. This is the reason why I suggested that we must handle that within our scripts within. We should not close this ticket until we ensured that the pipeline only fails and subsequently alerts us if something actionable happens, i.e. if we can and should actually do something about it, unless we choose to not be informed about pipeline failures but instead point to the smelt maintainer mailing list :)

#21 Updated by jbaier_cz 3 months ago

okurz wrote:

That can easily be prevented within gitlab CI by not scheduling any more subsequent pipelines as long as the previous one is not yet finished.

Do you know if that is possible and how to do it? I am not aware of such feature.

Yeah I guess that is not something that gitlab CI provides. This is the reason why I suggested that we must handle that within our scripts within. We should not close this ticket until we ensured that the pipeline only fails and subsequently alerts us if something actionable happens, i.e. if we can and should actually do something about it, unless we choose to not be informed about pipeline failures but instead point to the smelt maintainer mailing list :)

I see notifications for the failure and also the subsequent success of the pipeline, so I know there is an error if I do not have a pipeline succeeded e-mail after the pipeline failed e-mail. If you only want actionable errors, we can also disable the e-mail notification. As I already stated before, I am currently out of options. For me, the situation is same as if there were errors during maintenance period.

#23 Updated by jbaier_cz 3 months ago

cdywan wrote:

Another occurence: https://gitlab.suse.de/qa-maintenance/openQABot/-/jobs/844419

Not exactly the same, but yes.

The first one is 500 Server error, the second one is in fact 503 Service unavailable caused by restarting the machine during a maintenance window.

#24 Updated by okurz 3 months ago

jbaier_cz wrote:

okurz wrote:

That can easily be prevented within gitlab CI by not scheduling any more subsequent pipelines as long as the previous one is not yet finished.

Do you know if that is possible and how to do it? I am not aware of such feature.

Ok, actually it's like the other way around. https://docs.gitlab.com/ee/ci/pipelines/settings.html#auto-cancel-redundant-pipelines allows to cancel the old pipeline as soon as a new one starts.

Yeah I guess that is not something that gitlab CI provides. This is the reason why I suggested that we must handle that within our scripts within. We should not close this ticket until we ensured that the pipeline only fails and subsequently alerts us if something actionable happens, i.e. if we can and should actually do something about it, unless we choose to not be informed about pipeline failures but instead point to the smelt maintainer mailing list :)

I see notifications for the failure and also the subsequent success of the pipeline, so I know there is an error if I do not have a pipeline succeeded e-mail after the pipeline failed e-mail. If you only want actionable errors, we can also disable the e-mail notification. As I already stated before, I am currently out of options. For me, the situation is same as if there were errors during maintenance period.

Good that you mention that. I also don't want to get alerts for the SUSE IT maintenance period and I guess nobody by now expects us to resolve any such problems during that time. So also here we could think what to do. We could exclude all pipeline runs during the weekly IT maintenance window or have retrying within the scripting that goes on for a long enough time to cover the complete time period of the IT maintenance window.

This reminds me of https://github.com/okurz/leaky_bucket_error_count . How about putting a retry around the complete bot call from the outside?

#25 Updated by jbaier_cz 3 months ago

okurz wrote:

jbaier_cz wrote:

okurz wrote:

That can easily be prevented within gitlab CI by not scheduling any more subsequent pipelines as long as the previous one is not yet finished.

Do you know if that is possible and how to do it? I am not aware of such feature.

Ok, actually it's like the other way around. https://docs.gitlab.com/ee/ci/pipelines/settings.html#auto-cancel-redundant-pipelines allows to cancel the old pipeline as soon as a new one starts.

And I have a feeling it would misbehave in our setup as we have 2 schedules which actually runs the same job, so they will obsolete each other.

Yeah I guess that is not something that gitlab CI provides. This is the reason why I suggested that we must handle that within our scripts within. We should not close this ticket until we ensured that the pipeline only fails and subsequently alerts us if something actionable happens, i.e. if we can and should actually do something about it, unless we choose to not be informed about pipeline failures but instead point to the smelt maintainer mailing list :)

I see notifications for the failure and also the subsequent success of the pipeline, so I know there is an error if I do not have a pipeline succeeded e-mail after the pipeline failed e-mail. If you only want actionable errors, we can also disable the e-mail notification. As I already stated before, I am currently out of options. For me, the situation is same as if there were errors during maintenance period.

Good that you mention that. I also don't want to get alerts for the SUSE IT maintenance period and I guess nobody by now expects us to resolve any such problems during that time. So also here we could think what to do. We could exclude all pipeline runs during the weekly IT maintenance window or have retrying within the scripting that goes on for a long enough time to cover the complete time period of the IT maintenance window.

There is one problem though, if I am not mistaken SMELT maintenance window is not aligned to SUSE IT maintenance period (and I see that quite often that different teams have their own maintenance windows). In tools which heavily depends on querying services, we can end up with excluding all workdays.

This reminds me of https://github.com/okurz/leaky_bucket_error_count . How about putting a retry around the complete bot call from the outside?

Might help in some cases, but the issue remains the same. The remote server is just down and it take while before it starts to answer again. I added some more retry attempts, this can give the remote side some time as it can now wait for 20 minutes (total time between the first and last attempt on each request). It could, of course, bring a new problem with multiple jobs pilling up in the queue, but we can solved that if and when that happen.

#26 Updated by jbaier_cz 3 months ago

So far no new error in the pipeline even during the maintenance period. Unfortunately, I do not have enough data to prove it was the retry what helped or if we were just lucky to not hit the unresponsive server.

#27 Updated by okurz 3 months ago

Very nice. You can resolve the ticket though because you at least verified that your changes didn't break something yet. And if it actually doesn't help we will know from alerts

#28 Updated by jbaier_cz 3 months ago

  • Status changed from Feedback to Resolved

#29 Updated by okurz about 2 months ago

  • Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added

Also available in: Atom PDF