Project

General

Profile

Actions

action #106179

closed

coordination #91646: [saga][epic] SUSE Maintenance QA workflows with fully automated testing, approval and release

coordination #109641: [epic] qem-bot improvements

No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S

Added by mgrifalconi almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Start date:
2022-02-08
Due date:
% Done:

0%

Estimated time:


Related issues 3 (0 open3 closed)

Related to QA (public) - coordination #106546: [epic][tools] dashboard.qem.suse.de adoptionResolvedjbaier_cz2022-02-10

Actions
Related to QA (public) - action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:MResolvedosukup2022-02-22

Actions
Related to QA (public) - action #107671: No aggregate maintenance runs scheduled today on osd size:MResolvedjbaier_cz

Actions
Actions #1

Updated by mgrifalconi almost 3 years ago

  • Subject changed from No aggregate runs scheduled today - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down
Actions #2

Updated by osukup almost 3 years ago

http https://dashboard.qam.suse.de/                   

http: error: SSLError: HTTPSConnectionPool(host='dashboard.qam.suse.de', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))) while doing a GET request to URL: https://dashboard.qam.suse.de/
Actions #3

Updated by okurz almost 3 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #4

Updated by okurz almost 3 years ago

  • Project changed from openQA Project (public) to QA (public)
  • Category deleted (Regressions/Crashes)
Actions #5

Updated by osukup almost 3 years ago

dehydrated works as excepted and nginx works withot problems .. but from error.log:

022/02/07 23:08:07 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:08:37 [error] 25308#25308: *18820037 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:07 [error] 25308#25308: *18820039 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
2022/02/07 23:09:37 [error] 25308#25308: *18791264 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.100.51.185, server: dashboard.qam.suse.de, request: "GET /app/api/blocked HTTP/1.1", upstream: "http://127.0.0.1:4000/app/api/blocked", host: "dashboard.qam.suse.de", referrer: "http://dashboard.qam.suse.de/blocked"
Actions #6

Updated by osukup almost 3 years ago

broken connection between dashboard service and postgresql database ...

Actions #7

Updated by mkittler almost 3 years ago

When just opening https://dashboard.qam.suse.de I get a 500 error for AJAX queries.

Actions #8

Updated by osukup almost 3 years ago

  • Status changed from New to Feedback
  • Assignee set to osukup

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Actions #9

Updated by mgrifalconi almost 3 years ago

Great, thank you!
Will today's test be scheduled now or are they lost and will only run tomorrow?

Actions #10

Updated by kraih almost 3 years ago

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

Actions #11

Updated by okurz almost 3 years ago

  • Priority changed from Immediate to Urgent

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

Actions #12

Updated by okurz almost 3 years ago

  • Due date set to 2022-02-22
Actions #13

Updated by kraih almost 3 years ago

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.
Actions #14

Updated by jbaier_cz almost 3 years ago

kraih wrote:

This is the second time we are having trouble with the postgres container. That really shouldn't happen. Can we maybe move postgres and deploy it without a container?

Nope, this is the second time we are having trouble with the networking (the first problem was caused by incorrect setting in main dhcp and wicked being weird). Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

okurz wrote:

osukup wrote:

fixed .... it looks like IP of posgresql container changed, so fixed dashboard.yml with new correct ip

Please provide more details here, e.g. reference the git commit containing the dashboard.yml change. And why do we need to manually maintain IP addresses? Lowering priority after the urgency of "immediate" was addressed

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

It is a little bit mystery why the IP changed as there is a static lease. The configuration for dnsmasq was enhanced to include the mac address and after restarting the networking in the container, the correct address was offered and accepted.
As a follow-up, we could reconfigure the container to have static network configuration instead of the current DHCP setup (although there is nothing wrong on the current setup).

kraih wrote:

We just had more problems with the postgres container becoming unreachable.

Feb 08 12:15:59 qam2 dashboard[6292]: [6292] [e] [Qys9dKEf7610] DBI connect('dbname=dashboard_db;host=192.168.0.48;port=5432','dashboard_user',...) failed: connection to server at "192.168.0.48", port 5432 failed: No route to host
Feb 08 12:15:59 qam2 dashboard[6292]:         Is the server running on that host and accepting TCP/IP connections? at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Pg.pm line 73.

And that was me, fixing the lease and changing the configuration back as the .48 address was totally random and dynamic.

Actions #15

Updated by kraih almost 3 years ago

jbaier_cz wrote:

Running postgres outside container does not make it magically better. The solution could be to use postgres from dbproxy.suse.de (infra managed, clustered instance).

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

Actions #16

Updated by jbaier_cz almost 3 years ago

kraih wrote:

Actually it would, outside the container it could just use a UNIX domain socket to connect to postgres. No dependence on internal networking. We have postgres deployed that way on countless machines, and they are usually rock solid.

That's true, but only if you also run all the services which use the database from the same machine (which was not the case on qam2).

Actions #17

Updated by mgrifalconi almost 3 years ago

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

Actions #18

Updated by osukup almost 3 years ago

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Actions #19

Updated by jbaier_cz almost 3 years ago

mgrifalconi wrote:

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

Just a side note, if you keep private stuff in env variables you will loose the option to track the changes and restore them. And in this case, the configuration is nothing but a few secret tokens. Template language is an overkill.

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

Actions #20

Updated by kraih almost 3 years ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

Totally agree, the less things to manage the better. The server was never meant to run any database anyways.

Then let's do that.

Actions #21

Updated by okurz almost 3 years ago

osukup wrote:

for todays issue --> probadly best solution is close this poo ( is as subject is about todays outage)

for future we should consider move from own postgesql container to infra managed database , new poo?

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

mgrifalconi wrote:

The configuration is not in git (sensitive material inside), so the change needs to be done directly on the machine where dashboard is running.

Maybe a private git/gitlab repo? Could be useful to be able to track all changes in the config and easily restore them.
Password could still be kept out as env variables, using some templating language etc.

I also don't understand how the configuration can not be stored in git.

Actions #22

Updated by jbaier_cz almost 3 years ago

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

  • I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)
  • I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)
  • correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)
My proposal
  1. Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)
  2. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)
  3. Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"
Actions #23

Updated by okurz almost 3 years ago

jbaier_cz wrote:

okurz wrote:

No, it's fine. We should keep this ticket. Also see https://progress.opensuse.org/projects/qa/wiki/tools#How-we-work-on-our-backlog about improvements on regressions.

Just a nitpick: this is technically not a regression as we didn't change anything in the first place; it can be qualified as a bigger issue though; if we are proceeding with improvements I would also like to finally tweak the deployment process as that came up as a quick improvisation

Yes, I agree. We are not talking about fixing the original observation but improving with "feature work".

I also don't understand how the configuration can not be stored in git.

I think we should be more specific. There are several things which are mixed together and I have a feeling we are not understanding each other.

  • I do not think it is a good idea to have deployment specific configuration (database address, username / password) in the same repository with the code (because then you cannot deploy multiple copies, ...)

+1

  • I agree that the configuration should be in some sort of git; that is the reason the deployment is done by ansible and part of the configuration is actually already there (at least the service files, user/group settings and such), so it is in the git (just without the passwords for now)

+1

  • correct me if I am wrong, but the dashboard is still more like proof-of-concept than a production grade software, so we really did not have any time to tweak the accompanied processes (hence it is just deployed somewhere with some database)

+1 as well. I am most concerned with how others became reliant on qem-dashboard+qem-bot. There was a decision conducted by some in 2021-08 to delete code from qa-maintenance/openQABot so that now only qem-bot can handle incidents and aggregate tests which is effectively making qem-dashboard+qem-bot a critical component even though we may not like that

My proposal
  1. Migrate database to a proper database cluster, EngInfra maintained if possible -- dbproxy.suse.de (as an alternative, share with OSD?)

I understood that the database content can be considered transient and is effectively recreated automatically so I would not include it in a database cluster that treats the included data with a more expensive high-redundancy and backup process. And I don't think the data should be shared with OSD unless we merge what qem-dashboard provides actually within openQA itself.

  1. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

  1. Document where one need to create a PR to change the deployment configuration and make sure people just did not "change that in production"

+1

Actions #24

Updated by livdywan almost 3 years ago

  • Subject changed from No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down to No aggregate maintenance runs scheduled today on osd - dashboard.qem.suse.de down size:S
  • Description updated (diff)
Actions #25

Updated by jbaier_cz almost 3 years ago

Actions #26

Updated by jbaier_cz almost 3 years ago

  • Status changed from Feedback to Resolved

okurz wrote:

  1. Find a better place to deploy the application itself, if there should be any guarantees about availability (it had some meaning historically as it shared the machine with the bot-ng, but that is no longer true)

can you just share for our readers a little bit more how the current setup looks like please :)

I created a new epic to coordinate, will comment there.

This particular issue was solved.

Actions #27

Updated by okurz almost 3 years ago

  • Related to action #107227: bot-ng schedule aborted with "ERROR: something wrong with /etc/openqabot/singlearch.yml" size:M added
Actions #28

Updated by okurz almost 3 years ago

  • Related to action #107671: No aggregate maintenance runs scheduled today on osd size:M added
Actions #29

Updated by okurz over 2 years ago

  • Parent task set to #91646
Actions #30

Updated by okurz over 2 years ago

  • Parent task changed from #91646 to #109641
Actions #31

Updated by okurz over 2 years ago

  • Due date deleted (2022-02-22)
Actions

Also available in: Atom PDF