Project

General

Profile

Actions

action #161309

closed

osd not accessible, 502 Bad Gateway

Added by jbaier_cz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2024-05-31
Due date:
% Done:

0%

Estimated time:

Description

Observation

Users pointed out in https://suse.slack.com/archives/C02CANHLANP/p1717141066882819 that osd is down, the problem was confirmed. As the nginx was complaining about its upstream, openqa-webui.service was restarted to remedy the situation. After that, the following error was seen in the logs

May 31 09:39:52 openqa openqa[19648]: Error when trying to get the database version: DBIx::Class::Storage::DBI::_do_query(): DBI Exception: DBD::Pg::db do failed: ERROR:  syntax error at or near "PRAGMA"
May 31 09:39:52 openqa openqa[19648]: LINE 1: PRAGMA synchronous = OFF
May 31 09:39:52 openqa openqa[19648]:         ^ at inline delegation in DBIx::Class::DeploymentHandler::VersionStorage::Standard for version_rs->database_version (attribute declared in /usr/lib/perl5/vendor_perl/5.26.1/DBIx/Class/DeploymentHandler/VersionStorage/Standard.pm at line>

No service restart helped. As there was also a glibc update involved we decided to ensure consistency with a full system reboot as commented in #161309-1

There was a deployment running https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/1149031 just before the incident, the issue can be already seen in the telegraf post-deploy job.

openqa.suse.de:
    2024-05-31T07:33:42Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/jobs]: received status code 502 (Bad Gateway), expected any value out of [200]
    2024-05-31T07:33:42Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/influxdb/minion]: received status code 502 (Bad Gateway), expected any value out of [200]
    2024-05-31T07:33:45Z E! [telegraf] Error running agent: input plugins recorded 2 errors

Rollback actions

* delete deploy freeze for today in bot-ng to re-enable pipeline scheduling DONE


Related issues 5 (1 open4 closed)

Related to openQA Infrastructure - action #156460: Potential FS corruption on osd due to 2 VMs accessing the same diskResolvednicksinger2024-03-01

Actions
Copied to openQA Infrastructure - action #161318: Ensure we have a consistent racktables entry for OSDResolvedokurz2024-05-31

Actions
Copied to openQA Infrastructure - action #161321: OSD status overview in monitoring shows osd is online despite the whole machine being downNew2024-05-31

Actions
Copied to openQA Infrastructure - action #161324: Conduct "lessons learned" with Five Why analysis for "osd not accessible, 502 Bad Gateway"Resolvedokurz2024-05-31

Actions
Copied to openQA Infrastructure - action #162332: 2024-06-15 osd not accessible size:MResolvedokurz

Actions
Actions

Also available in: Atom PDF