Project

General

Profile

action #88439

[sporadic][qe-core] test fails in php7_postgresql

Added by ggardet_arm 8 months ago. Updated 10 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Tags:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-JeOS-for-AArch64-arm-jeos@aarch32-HD24G fails in
php7_postgresql

EDIT: The following error from logs seems to be non fatal:

# wait_serial expected: "echo -e \"<?php\nphpinfo()\n?>\" > /srv/www/htdocs/index.php; echo lPSng-\$?-"
# Result:
echo -e "<?php
> phpinfo()
> ?>" > /srv/www/htdocs/index.php; echo lPSng-$?-

But it fails later with pg_ctl -D /tmp/psql status showing that the server is not started yet.

Test suite description

Maintainer: fvogt, mnowak

Start JeOS from the HDD image, configure it using the firstboot wizard and then run basic tests. console=tty0 added as needed for aarch64.

Reproducible

Fails since (at least) Build 20210121

Expected result

Last good: 20210118 (or more recent)

Further details

Always latest result in this scenario: latest

History

#1 Updated by ggardet_arm 8 months ago

  • Subject changed from test fails in php7_postgresql to [sporadic] test fails in php7_postgresql

#2 Updated by ggardet_arm 8 months ago

  • Description updated (diff)

#3 Updated by okurz 7 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1625743

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#4 Updated by okurz 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode
https://openqa.opensuse.org/tests/1660783

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#5 Updated by okurz 6 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1676318

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#6 Updated by michel_mno 6 months ago

Similar problem for ppc64/ppc64le
Create a new pr: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12193

#7 Updated by okurz 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1693356

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#8 Updated by michel_mno 5 months ago

despite merged pr#12193 since 20210401 there are still sporadic failure, eg ppc64le:
https://openqa.opensuse.org/tests/1700104#step/php7_postgresql/41

#9 Updated by openqa_review 5 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1693356

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#10 Updated by openqa_review 4 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1732252

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#11 Updated by okurz 4 months ago

  • Subject changed from [sporadic] test fails in php7_postgresql to [sporadic][tools] test fails in php7_postgresql
  • Assignee set to osukup
  • Target version set to Ready

osukup you are the test module maintainer so assigning to you.

tjyrinki_suse if within the tools team in the hands of osukup we won't be able to solve this I would suggest fr QE-Core to take over. This ticket seems to have been overlooked but as can be seen in reminder comments there are multiple recurring problems.

#12 Updated by tjyrinki_suse 4 months ago

  • Tags set to jeos

#13 Updated by osukup 4 months ago

looks more than bug in postgress/ postgress systemd unit ->

Apr 08 08:31:48.405373 ip-10-0-2-15 systemd[1]: Starting PostgreSQL database server...
Apr 08 08:31:50.789778 ip-10-0-2-15 postgresql-script[18267]: Initializing PostgreSQL 13.2 at location /var/lib/pgsql/data
Apr 08 08:33:18.308245 ip-10-0-2-15 systemd[1]: postgresql.service: start operation timed out. Terminating.
Apr 08 08:34:31.050131 ip-10-0-2-15 wickedd-dhcp6[993]: enp0s3: DHCPv6 is disabled by IPv6 router RA
Apr 08 08:34:31.071432 ip-10-0-2-15 wickedd-dhcp6[993]: enp0s3: DHCPv6 is disabled by IPv6 router RA
Apr 08 08:34:48.560094 ip-10-0-2-15 systemd[1]: postgresql.service: State 'final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Apr 08 08:34:48.562823 ip-10-0-2-15 systemd[1]: postgresql.service: Failed with result 'timeout'.
Apr 08 08:34:48.572234 ip-10-0-2-15 systemd[1]: postgresql.service: Unit process 18287 (postgres) remains running after unit stopped.
Apr 08 08:34:48.696912 ip-10-0-2-15 systemd[1]: Failed to start PostgreSQL database server.
Apr 08 08:34:51.969434 ip-10-0-2-15 systemd[1]: Started Getty on tty5.

and later in journal:

Apr 08 08:35:31.302254 ip-10-0-2-15 kernel: task:postgres state:S stack: 0 pid:18287 ppid: 1 flags:0x00000000
Apr 08 08:35:31.306012 ip-10-0-2-15 kernel: Call trace:
Apr 08 08:35:31.309892 ip-10-0-2-15 kernel: __switch_to+0x10c/0x170
Apr 08 08:35:31.312496 ip-10-0-2-15 kernel: __schedule+0x324/0x990
Apr 08 08:35:31.315047 ip-10-0-2-15 kernel: schedule+0x54/0xdc
Apr 08 08:35:31.317595 ip-10-0-2-15 kernel: futex_wait_queue_me+0xbc/0x14c
Apr 08 08:35:31.321044 ip-10-0-2-15 kernel: futex_wait+0xe4/0x240
Apr 08 08:35:31.323562 ip-10-0-2-15 kernel: do_futex+0x140/0xd7c
Apr 08 08:35:31.327187 ip-10-0-2-15 kernel: __arm64_sys_futex+0x11c/0x194
Apr 08 08:35:31.329999 ip-10-0-2-15 kernel: el0_svc_common.constprop.0+0x84/0x210
Apr 08 08:35:31.332533 ip-10-0-2-15 kernel: do_el0_svc+0x80/0xa0
Apr 08 08:35:31.335059 ip-10-0-2-15 kernel: el0_svc+0x28/0x70
Apr 08 08:35:31.337582 ip-10-0-2-15 kernel: el0_sync_handler+0x1a4/0x1b0
Apr 08 08:35:31.340430 ip-10-0-2-15 kernel: el0_sync+0x178/0x180

#14 Updated by okurz 4 months ago

hm, the call trace looks more like that the system is just being very busy so that eventually the postgres service startup times out. Ok, so what do you plan doing?

#15 Updated by okurz 4 months ago

  • Status changed from New to Workable

"Workable" as for "Bugs in existing tests" the ACs should be implicit: As soon as tests work stable again we are good :)

#16 Updated by osukup 4 months ago

  • Status changed from Workable to Blocked

#17 Updated by cdywan 4 months ago

Some notes from our call:

  • osukup triaged the bug
  • https://openqa.opensuse.org/tests/1732252
  • the test gets stuck at start-up, service can't get killed because postgres is in zombie state
  • this only happens on arm and ppc
  • hypothesis: original test doesn't finish cleanly, product bug
  • see BUG

#18 Updated by okurz 4 months ago

have you overlooked #88439#note-14 ? the bug report https://bugzilla.suse.com/show_bug.cgi?id=1186610 is missing a lot of information and as I stated: the call trace looks more like that the system is just being very busy so that eventually the postgres service startup times out. I doubt the progress package maintainer could do something reasonable to help.

#19 Updated by openqa_review 3 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1790899

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#20 Updated by cdywan 3 months ago

There's an open question in the bug report:

Reinhard Max 2021-06-08 13:45:26 UTC
Any chance to log into a system that shows this behaviour for further debugging?

#21 Updated by cdywan 3 months ago

  • Due date set to 2021-07-09

Let's put a due date to make this more visible

#22 Updated by ilausuch 2 months ago

  • Due date changed from 2021-07-09 to 2021-07-22

During in a review of blocked tickets we move the due date for a reasonable time to give the chance to osukup after his unexpected absence to answer the question in bugzilla and unblock the ticket

#23 Updated by ilausuch about 2 months ago

Please osukup could you take care of this ticket. We didn't hear from you about this ticket and we need some action from your side, or at least an answer for the bugzilla ticket or explanation about what is the situation in this case. It is blocked for 2 months and we need to resolve this situation.

#24 Updated by osukup about 2 months ago

preparing SUT in orthos - ETA 26/27.7

#25 Updated by ilausuch about 2 months ago

  • Due date changed from 2021-07-22 to 2021-07-28

Thanks for the answer, how we have an estimation we can move the due date a couple days ahead

#26 Updated by ilausuch about 2 months ago

osukup please, it passes 1 day since your estimation due date. Could you share the status?

#27 Updated by osukup about 2 months ago

unfortuanetly ticket is still without response from infra .. https://infra.nue.suse.com/SelfService/Display.html?id=192983

#28 Updated by okurz about 2 months ago

I can't see the ticket. What is it about? I guess you could try a systemd-nspawn container or also podman/docker with some hacks to have access to systemd.

#29 Updated by osukup about 2 months ago

okurz wrote:

I can't see the ticket. What is it about? I guess you could try a systemd-nspawn container or also podman/docker with some hacks to have access to systemd.

activity in the ticket is ... zero.

.. all proposed solutions have a small problem .. root access

so if anybody has aarch64 which can be reinstalled for TW/ or aarch64 VM? I will be very glad.

#30 Updated by cdywan about 2 months ago

osukup wrote:

.. all proposed solutions have a small problem .. root access

so if anybody has aarch64 which can be reinstalled for TW/ or aarch64 VM? I will be very glad.

Where do you need root access? Are you talking about a worker? Staging machines?

#31 Updated by cdywan about 2 months ago

  • Due date deleted (2021-07-28)
  • Status changed from Blocked to Feedback

I'm setting this to Feedback since it's unclear to me what we're missing here or if this is blocking on something external or simply open questions.

#32 Updated by okurz about 2 months ago

  • Subject changed from [sporadic][tools] test fails in php7_postgresql to [sporadic][qe-core] test fails in php7_postgresql
  • Status changed from Feedback to New
  • Assignee changed from osukup to tjyrinki_suse
  • Target version deleted (Ready)

osukup whatever you wait for I consider it good courtesy to explain that in the bug report. Reinhard Max posted a question which is unanswered two months ago (!).

If you don't want to comment in the bug or are unable to do so then please tell us and someone else can take over.

I still don't know what the EngInfra ticket is about. If you have problems with a single aarch64 machine there are currently
around 40 free ARM machines on
https://orthos.arch.suse.de/index.php?&default_domain=6
However I consider this approach of manual testing for something that is not clearly reproducible a bit too expensive. As this is already about an openQA test then I suggest to use additional investigation code within the openQA test itself, e.g. in the post_fail_hook.

Regarding reproducibility I checked on o3 within the database for recent failures of "php7_postgresql":

openqa=> select job_id,t_created from job_modules where result = 'failed' and name = 'php7_postgresql' order by job_id desc limit 10;
 job_id  |         t_created          
---------+----------------------------
 1865822 | 2021-08-02 16:58:25.852556
 1865472 | 2021-08-02 03:53:58.415422
 1864591 | 2021-08-01 11:39:27.419158
 1861691 | 2021-07-30 00:02:20.121011
 1861212 | 2021-07-29 12:42:42.508814
 1859055 | 2021-07-27 13:53:12.735385
 1859054 | 2021-07-27 13:53:04.772199
 1858403 | 2021-07-27 01:55:17.292001
 1856659 | 2021-07-25 13:49:03.482586
 1855931 | 2021-07-25 12:00:40.839279
(10 rows)

maybe they fail for different reasons. This would still need to be checked.

Overall it seems after three months my option stated in
#88439#note-11 is true that we within the SUSE QE Tools team are unable to cover this ticket. Hence I am assigning to the QE Core team with the plead for help to properly take over ownership of the test module and continue with the investigation.

#33 Updated by openqa_review 29 days ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos@
https://openqa.opensuse.org/tests/1879856

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The label in the openQA scenario is removed

#34 Updated by tjyrinki_suse 10 days ago

  • Status changed from New to Blocked
  • Assignee deleted (tjyrinki_suse)
  • Start date deleted (2021-02-03)

I don't want to pingpong this but investigating this seems hard. osukup has filed an infra ticket about presumably a hardware problem, but from the other side this could be tried to be reproduced to find if there's something else happening. However as far as the current O3 history goes, there has been no incidents of this since the February one that has been preserved, but and also no recent runs as runs are blocked by other test problems so PHP test isn't run (it was last run, successfully, on 1st of July).

Also available in: Atom PDF