action #106179: No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S - QA (public) - openSUSE Project Management Tool

Actions

Copy link

action #106179

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #109641: [epic] qem-bot improvements

No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S

Added by mgrifalconi about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

osukup

Target version:

openQA Project (public) - Ready

Start date:

2022-02-08

Due date:

% Done:

Estimated time:

Description

No aggregate runs scheduled today - dashboard.qem.suse.de down

Link to list aggregate runs of the day: https://openqa.suse.de/tests/overview?result=none&result=passed&result=softfailed&result=failed&result=incomplete&result=skipped&result=obsoleted&result=parallel_failed&result=parallel_restarted&result=user_cancelled&result=user_restarted&result=timeout_exceeded&state=scheduled&state=assigned&state=setup&state=running&state=uploading&state=done&state=cancelled&arch=&flavor=&machine=&test=&modules=&module_re=&groupid=366&groupid=308&groupid=232&groupid=165&groupid=280&groupid=218&groupid=108&groupid=54&groupid=405&groupid=412&groupid=411&groupid=369&groupid=352&groupid=353&groupid=357&groupid=355&groupid=354&groupid=358&groupid=370&groupid=348&groupid=349&groupid=351&groupid=356&groupid=375&groupid=376&groupid=397&groupid=414&build=20220208-1#

This is blocking all update test/approval

Suggestions¶

Create an epic with feature requests based on this

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by mgrifalconi about 3 years ago

Subject changed from No aggregate runs scheduled today - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down

Actions

Copy link

Updated by osukup about 3 years ago

http https://dashboard.qam.suse.de/                   

http: error: SSLError: HTTPSConnectionPool(host='dashboard.qam.suse.de', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))) while doing a GET request to URL: https://dashboard.qam.suse.de/

Actions

Copy link

Updated by okurz about 3 years ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by okurz about 3 years ago

Project changed from openQA Project (public) to QA (public)
Category deleted (~~Regressions/Crashes~~)

Actions

Copy link

Updated by osukup about 3 years ago

dehydrated works as excepted and nginx works withot problems .. but from error.log:

022/02/07 23:08:07 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:08:37 [error] 25308#25308: *18820037 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:07 [error] 25308#25308: *18820039 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:37 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"

Actions

Copy link

Updated by osukup about 3 years ago

broken connection between dashboard service and postgresql database ...

Actions

Copy link

Updated by mkittler about 3 years ago

When just opening https://dashboard.qam.suse.de I get a 500 error for AJAX queries.

Actions

Copy link

Updated by osukup about 3 years ago

Status changed from New to Feedback
Assignee set to osukup

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Actions

Copy link

Updated by mgrifalconi about 3 years ago

Great, thank you!
Will today's test be scheduled now or are they lost and will only run tomorrow?

Actions

Copy link

#10

Updated by kraih about 3 years ago

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

Actions

Copy link

#11

Updated by okurz about 3 years ago

Priority changed from Immediate to Urgent

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

Actions

Copy link

#12

Updated by okurz about 3 years ago

Due date set to 2022-02-22

Actions

Copy link

#13

Updated by kraih about 3 years ago

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.

Actions

Copy link

#14

Updated by jbaier_cz about 3 years ago

kraih wrote:

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

Nope, this is the second time we are having trouble with the networking (the first problem was caused by incorrect setting in main dhcp and wicked being weird). Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

okurz wrote:

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

It is a little bit mystery why the IP changed as there is a static lease. The configuration for dnsmasq was enhanced to include the mac address and after restarting the networking in the container, the correct address was offered and accepted.
As a follow-up, we could reconfigure the container to have static network configuration instead of the current DHCP setup (although there is nothing wrong on the current setup).

kraih wrote:

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.

And that was me, fixing the lease and changing the configuration back as the .48 address was totally random and dynamic.

Actions

Copy link

#15

Updated by kraih about 3 years ago

jbaier_cz wrote:

Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

Actions

Copy link

#16

Updated by jbaier_cz about 3 years ago

kraih wrote:

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

That's true, but only if you also run all the services which use the database from the same machine (which was not the case on qam2).

Actions

Copy link

#17

Updated by mgrifalconi about 3 years ago

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

Actions

Copy link

#18

Updated by osukup about 3 years ago

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Actions

Copy link

#19

Updated by jbaier_cz about 3 years ago

mgrifalconi wrote:

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

Just a side note, if you keep private stuff in env variables you will loose the option to track the changes and restore them. And in this case, the configuration is nothing but a few secret tokens. Template language is an overkill.

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

Actions

Copy link

#20

Updated by kraih about 3 years ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

Then let's do that.

Actions

Copy link

#21

Updated by okurz about 3 years ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

mgrifalconi wrote:

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.
Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

I also don't understand how the configuration can not be stored in git.

Actions

Copy link

#22

Updated by jbaier_cz about 3 years ago

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)
I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)
correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)

My proposal¶

Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)
Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)
Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"

Actions

Copy link

#23

Updated by okurz about 3 years ago

jbaier_cz wrote:

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

Yes, I agree. We are not talking about fixing the original observation but improving with "feature work".

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)

I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)

correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)

+1 as well. I am most concerned with how others became reliant on qem-dashboard+qem-bot. There was a decision conducted by some in 2021-08 to delete code from qa-maintenance/openQABot so that now only qem-bot can handle incidents and aggregate tests which is effectively making qem-dashboard+qem-bot a critical component even though we may not like that

My proposal¶

Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)

I understood that the database content can be considered transient and is effectively recreated automatically so I would not include it in a database cluster that treats the included data with a more expensive high-redundancy and backup process. And I don't think the data should be shared with OSD unless we merge what qem-dashboard provides actually within openQA itself.

Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"

Actions

Copy link

#24

Updated by livdywan about 3 years ago

Subject changed from No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S
Description updated (diff)

Actions

Copy link

#25

Updated by jbaier_cz about 3 years ago

Related to coordination #106546: [epic][tools] dashboard.qem.suse.de adoption added

Actions

Copy link

#26

Updated by jbaier_cz about 3 years ago

Status changed from Feedback to Resolved

okurz wrote:

Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

I created a new epic to coordinate, will comment there.

This particular issue was solved.

Actions

Copy link

#27

Updated by okurz about 3 years ago

Related to action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M added

Actions

Copy link

#28

Updated by okurz about 3 years ago

Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added

Actions

Copy link

#29

Updated by okurz about 3 years ago

Parent task set to #91646

Actions

Copy link

#30

Updated by okurz about 3 years ago

Parent task changed from #91646 to #109641

Actions

Copy link

#31

Updated by okurz almost 3 years ago

Due date deleted (~~2022-02-22~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public)

Tags

Custom queries

action #106179

No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S

Suggestions¶

Updated by mgrifalconi about 3 years ago

Updated by osukup about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by osukup about 3 years ago

Updated by osukup about 3 years ago

Updated by mkittler about 3 years ago

Updated by osukup about 3 years ago

Updated by mgrifalconi about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by kraih about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by kraih about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by mgrifalconi about 3 years ago

Updated by osukup about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by kraih about 3 years ago

Updated by okurz about 3 years ago

Updated by jbaier_cz about 3 years ago

My proposal¶

Updated by okurz about 3 years ago

My proposal¶

Updated by livdywan about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by jbaier_cz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz almost 3 years ago