action #58289

Huge amount of "Needle file .* not found where expected. Check /var/lib/openqa for distri symlinks" on o3 in /var/log/openqa

Added by okurz 4 months ago. Updated 4 months ago.

Status:ResolvedStart date:17/10/2019
Priority:NormalDue date:
Assignee:okurz% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

Observation

[2019-10-17T01:42:10.0358 UTC] [error] [pid:21333] Needle file /var/lib/openqa/share/42-installation_overview-Staging_Update-20190826.json not found where expected. Check /var/lib/openqa for distri symlinks
[2019-10-17T01:42:10.0364 UTC] [error] [pid:21333] Needle file /var/lib/openqa/share/inst-overview-gnome-leap-20161212.json not found where expected. Check /var/lib/openqa for distri symlinks
[2019-10-17T01:42:10.0371 UTC] [error] [pid:21333] Needle file /var/lib/openqa/share/inst-overview-gnome-leap-20180914.json not found where expected. Check /var/lib/openqa for distri symlinks
…

The needles are in /var/lib/openqa/share/tests/opensuse/products/opensuse/needles and should be searched there instead of "/var/lib/openqa/share/".

The first mention seems to be [2019-10-17T01:35:53.0168 UTC] [error] [pid:1946] Needle file /var/lib/openqa/share/grub2-TW-virtio-20190303.json not found where expected. Check /var/lib/openqa for distri symlinks in /var/log/openqa.1.xz

The webUI was restarted at 01:00 UTC, so most likely the worker upgrade.

I received reports from openqa-logwarn trying to post 20MB emails to o3-admins@suse.de due to this, not sure if there are other impacts.

Problem

From /var/log/zypp/history:

2019-10-16 01:00:42|install|os-autoinst|4.5.1571127896.7bd3da32-lp151.193.1|x86_64||devel_openQA|1955dc706c08f41bdbe416d015a3913158b5109a5c0c65d079624f47ecbd6f4b|
2019-10-16 01:00:54|install|openQA-worker|4.6.1571122761.67cc75da9-lp151.1919.1|noarch||devel_openQA|3fb5de00ae76ea4012ae21b3f8951c1c15bdaab5059d3c2f0bfc8f01c7353065|
2019-10-17 01:50:51|install|os-autoinst|4.5.1571258068.dd114f84-lp151.195.1|x86_64||devel_openQA|ebb0d827dc41f42b7eff028260f164beaa09045f6e6175f01532867476facd54|
2019-10-17 01:51:00|install|openQA-worker|4.6.1571253176.1a322744e-lp151.1926.1|noarch||devel_openQA|1e3b706f34bfee6253d3cd399c00a20b1bf075f8a2945def74d53e87995f0170|
  • os-autoinst: 7bd3da32..dd114f84
$ git log1 --no-merges 7bd3da32..dd114f84
5df73dd6 (okurz/fix/typo) needle: Fix typo 'parrent'
9ce54ebe Use $needle::needles_dir in needle downloader of developer mode
df9256f6 Log data and pool dir when running fullstack test
aabfa1ab Allow loading needles from current working directory
e1fb6561 Improve error handling when parsing needle JSON
0e6da28e Extend architecture.md to cover needle handling
  • openQA: 67cc75da9..1a322744e
$ git log1 --no-merges 67cc75da9..1a322744e
916c45f5c PostgreSQL errors can be localized, so just use the name of the unique constraint
ce83ab943 (okurz/enhance/worker_reconnect) worker: Do not treat reconnect attempts as errors but with warning only
8811ad46c (Martchus/prevent-deadlocks) Remove wrong error handling code when sending ws messages
7506e0ae4 Prevent potential deadlocks in scheduler and ws server
81d318dd5 Hide old job templates editor for new groups
3172996fd (kraih/screenshots_resultset) Handle unique constraint correctly
51967db7f Add missing resultset for screenshots and make a few small optimizations
94afcda64 Drop -v flag on test runs and avoid noisy job "name"
e7c3f3cff clone job: Support specifying a port in host URL

I suspect os-autoinst aabfa1ab but it could be openQA 3172996fd as well.

I rolled back the workers with for i in aarch64 openqaworker1 openqaworker4 imagetester; do echo $i && ssh root@$i "transactional-update rollback last && reboot"; done


Related issues

Related to openQA Project - action #56789: New needles from git repository not working with openqa-c... New 11/09/2019

History

#1 Updated by okurz 4 months ago

  • Target version set to Current Sprint

#2 Updated by okurz 4 months ago

sudo tail -f /var/log/openqa | grep 'not found where expected' still shows a lot after worker rollback. So maybe the webui upgrade then?

zypper in --oldpackage /var/cache/zypp/packages/devel_openQA/noarch/openQA{,-common,-client}-4.6.1571122761.67cc75da9-lp151.1919.1.noarch.rpm
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following 3 packages are going to be downgraded:
  openQA openQA-client openQA-common

3 packages to downgrade.
Overall download size: 2.4 MiB. Already cached: 0 B. After the operation, 1.1 KiB will be freed.
Continue? [y/n/v/...? shows all options] (y): d

The following 3 packages are going to be downgraded:
  openQA       
    4.6.1571253176.1a322744e-lp151.1926.1 -> 4.6.1571122761.67cc75da9-lp151.1919.1  noarch
    Plain RPM files cache  obs://build.opensuse.org/devel:openQA
  openQA-client
    4.6.1571253176.1a322744e-lp151.1926.1 -> 4.6.1571122761.67cc75da9-lp151.1919.1  noarch
    Plain RPM files cache  obs://build.opensuse.org/devel:openQA
  openQA-common
    4.6.1571253176.1a322744e-lp151.1926.1 -> 4.6.1571122761.67cc75da9-lp151.1919.1  noarch
    Plain RPM files cache  obs://build.opensuse.org/devel:openQA

… still happening

#3 Updated by okurz 4 months ago

  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

#4 Updated by Guillaume_G 4 months ago

May be related: we cannot access any needle from openQA web interface: click on "Screenshot" list on https://openqa.opensuse.org/tests/1059301#step/setup_zdup/4 and you have no image listed.

#5 Updated by okurz 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

@Guillaume_G yes, I think that's the same problem.

I think we found it. Thanks to mkittler and nsinger for the quick mob debug session. This is fun :)

It was the commit I suspected, just that the rollback on aarch64 was incomplete
https://github.com/os-autoinst/os-autoinst/pull/1233 for the revert-fix

The faulty commit changed the needle path that is sent from the worker to the webUI to only mention the file name so that the webUI can not reference the screenshots correctly anymore.

#6 Updated by okurz 4 months ago

  • Related to action #56789: New needles from git repository not working with openqa-clone-custom-git-refspec added

#7 Updated by okurz 4 months ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal

With the revert merged the next nightly update should also be fine. The workers are currently rolled back to the old version so also ok. Urgency removed. I can check correct state of webUI and worker tomorrow again and leave the rest of the work to #56789

#8 Updated by okurz 4 months ago

  • Status changed from Feedback to Resolved

workers do not exactly look ok, see #58403 , but the rest is fine

Also available in: Atom PDF