Project

General

Profile

Actions

action #160628

closed

coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

periodic multi-machine OSD test in https://gitlab.suse.de/openqa/scripts-ci/ does not trigger any jobs size:S

Added by okurz 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-01-30
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2635998 is a passed job but apparently no jobs were ever triggered:

++ echo '$ bash -e < <(curl -s ${GITHUB_REPO_URL//github.com/raw.githubusercontent.com}/master/$OS_AUTOINST_SCRIPT)'
++ bash -e
+++ curl -s https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-schedule-mm-ping-test
+ openqa_url=http://openqa.suse.de
+ distri=sle
+ flavor=Server-DVD-Updates
+ arch=x86_64
+ version=15-SP5
+ test_name=ovs-client
++ mktemp
+ tmpfile=/tmp/tmp.QFrAbZzq4U
+ trap 'rm -f "$tmpfile"' EXIT
+ cat
++ openqa-cli api --host http://openqa.suse.de jobs version=15-SP5 scope=relevant arch=x86_64 flavor=Server-DVD-Updates test=ovs-client latest=1
++ jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+ hdd=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240519-1-Server-DVD-Updates-64bit.qcow2
+ rm -f /tmp/tmp.QFrAbZzq4U
Cleaning up project directory and file based variables 00:00
Job succeeded

Executing locally looks correct:

$ test_name=ovs-client flavor=Server-DVD-Updates version=15-SP5 distri=sle openqa_url=http://openqa.suse.de ./openqa-schedule-mm-ping-test
+(./openqa-schedule-mm-ping-test:4): main(): openqa_url=http://openqa.suse.de
+(./openqa-schedule-mm-ping-test:5): main(): distri=sle
+(./openqa-schedule-mm-ping-test:6): main(): flavor=Server-DVD-Updates
+(./openqa-schedule-mm-ping-test:7): main(): arch=x86_64
+(./openqa-schedule-mm-ping-test:8): main(): version=15-SP5
+(./openqa-schedule-mm-ping-test:9): main(): test_name=ovs-client
++(./openqa-schedule-mm-ping-test:11): main(): mktemp
+(./openqa-schedule-mm-ping-test:11): main(): tmpfile=/tmp/tmp.TfQJAGLQR9
+(./openqa-schedule-mm-ping-test:12): main(): trap 'rm -f "$tmpfile"' EXIT
+(./openqa-schedule-mm-ping-test:14): main(): cat
++(./openqa-schedule-mm-ping-test:53): main(): openqa-cli api --host http://openqa.suse.de jobs version=15-SP5 scope=relevant arch=x86_64 flavor=Server-DVD-Updates test=ovs-client latest=1
++(./openqa-schedule-mm-ping-test:53): main(): jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+(./openqa-schedule-mm-ping-test:53): main(): hdd=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240519-1-Server-DVD-Updates-64bit.qcow2
++(./openqa-schedule-mm-ping-test:54): main(): date -Im
+(./openqa-schedule-mm-ping-test:54): main(): openqa-cli schedule --monitor --host http://openqa.suse.de --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.TfQJAGLQR9 DISTRI=sle VERSION=15-SP5 FLAVOR=Server-DVD-Updates ARCH=x86_64 BUILD=2024-05-21T08:37+02:00 _GROUP_ID=0 HDD_1=SLES-15-SP5-x86_64-mru-install-minimal-with-addons-Build20240519-1-Server-DVD-Updates-64bit.qcow2
{"count":2,"failed":[],"ids":[14384617,14384618],"scheduled_product_id":2126387}
2 jobs have been created:
 - http://openqa.suse.de/tests/14384617
 - http://openqa.suse.de/tests/14384618
{"blocked_by_id":null,"id":14384617,"result":"none","state":"scheduled"}
Job state of job ID 14384617: scheduled, waiting … (delay: 10; waited 0s)
{"blocked_by_id":null,"id":14384617,"result":"none","state":"running"}
…
Job state of job ID 14384617: running, waiting … (delay: 10; waited 100s)
{"blocked_by_id":null,"id":14384617,"result":"passed","state":"done"}
{"blocked_by_id":null,"id":14384618,"result":"passed","state":"done"}

Reproduces consistently within gitlab CI

Steps to reproduce

Suggestions

  • Try to reproduce the problem within the container environment registry.opensuse.org/devel/openqa/ca/containers/os-autoinst-scripts . Maybe the "time" prefix in the call time openqa-cli schedule is the problem?

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
Copied to openQA Project - action #160820: openqa-cli: Do not read from STDIN unless explicitly requested size:SResolvedmkittler

Actions
Actions #1

Updated by okurz 7 months ago

  • Copied from action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added
Actions #2

Updated by okurz 7 months ago

  • Description updated (diff)
Actions #3

Updated by livdywan 7 months ago

  • Subject changed from periodic multi-machine OSD test in https://gitlab.suse.de/openqa/scripts-ci/ does not trigger any jobs to periodic multi-machine OSD test in https://gitlab.suse.de/openqa/scripts-ci/ does not trigger any jobs size:S
  • Status changed from New to Workable
Actions #4

Updated by mkittler 7 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #5

Updated by mkittler 7 months ago · Edited

The o3-related jobs are equally affected, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2636833.

Also not reproducible in the container (e.g. podman container run -v$PWD:/opt --rm -it registry.opensuse.org/devel/openqa/ca/containers/os-autoinst-scripts bash):

…
+ cat
++ openqa-cli api --host https://openqa.opensuse.org jobs version=Tumbleweed scope=relevant arch=x86_64 flavor=DVD test=ping_client latest=1
++ jq -r '.jobs | map(select(.result == "passed")) | max_by(.settings.BUILD) .settings.HDD_1'
+ hdd=opensuse-Tumbleweed-x86_64-20240226-textmode@64bit.qcow2
++ date -Im
+ openqa-cli schedule --monitor --host https://openqa.opensuse.org --param-file SCENARIO_DEFINITIONS_YAML=/tmp/tmp.jShRofY0CL DISTRI=opensuse VERSION=Tumbleweed FLAVOR=DVD ARCH=x86_64 BUILD=2024-05-21T15:45+00:00 _GROUP_ID=0 HDD_1=opensuse-Tumbleweed-x86_64-20240226-textmode@64bit.qcow2
403 Forbidden
{"error":"no api key","error_status":403}

real    0m0.487s
user    0m0.321s
sys     0m0.043s
+ rm -f /tmp/tmp.jShRofY0CL
Actions #6

Updated by openqa_review 7 months ago

  • Due date set to 2024-06-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by mkittler 6 months ago

I still have no idea. Even this version stops early (after the assignment of the json variable; the assigned JSON looks good). So it basically already fails after the first openqa-cli invocation and removing pipefail doesn't change anything.

Actions #8

Updated by tinita 6 months ago · Edited

I was curious and had a short look.
I was able to create a small reproducer, but am not sure yet what's the exact problem, only that it's < <(curl script) vs. ./script, and only if openqa-cli is called. Doing cat some.json | jq ... is fine.

 # cat script.sh 
#!/bin/bash
set -eux -o pipefail

hdd=$(openqa-cli api --o3 jobs/4214329)
echo "HDD: $hdd"

echo "-------------- END -----------------"

# bash < <(cat script.sh)
++ openqa-cli api --o3 jobs/4214329
+ hdd='{"job":{"assets":...}}'

# ./script.sh 
++ openqa-cli api --o3 jobs/4214329
+ hdd='{"job":{...}}'
+ echo 'HDD: {"job":{"assets":{...}}'
HDD: {"job":{"assets":{...}}
+ echo '-------------- END -----------------'
-------------- END -----------------

edit: and the failure same for:

# bash < script.sh
Actions #9

Updated by tinita 6 months ago · Edited

Aha! perl is reading STDIN, and with bash < ... the whole process is using the same STDIN. So after the call to openqa-cli STDIN is empty, so no more bash lines to execute.
Reproduced with this:

% cat script.sh 
#!/bin/bash
perl openqa-cli
echo "-------------- END -----------------"

% cat openqa-cli 
#!/usr/bin/env perl
use v5.10;
say '{"some":"json"}';

sub data_from_stdin { # from OpenQA/Command.pm
    vec(my $r = '', fileno(STDIN), 1) = 1;
    return !-t STDIN && select($r, undef, undef, 0) ? join '', <STDIN> : '';
}
my $test = data_from_stdin();
say "data_from_stdin: '$test'";

% ./script.sh     
{"some":"json"}
data_from_stdin: ''
-------------- END -----------------

% bash < script.sh
{"some":"json"}
data_from_stdin: 'echo "-------------- END -----------------"
'

So the fix is to just download the script to a temp file and then execute it :)

Actions #10

Updated by mkittler 6 months ago

Thank you, I'll try that.

Actions #11

Updated by mkittler 6 months ago

  • Status changed from In Progress to Feedback

I created and merged https://gitlab.suse.de/openqa/scripts-ci/-/merge_requests/6. It works, e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/2647057 now enters the scheduling/monitoring step. I also re-triggered the pipeline for OSD.

Actions #12

Updated by mkittler 6 months ago

  • Status changed from Feedback to Resolved

The jobs are passing now correctly.

Actions #13

Updated by tinita 6 months ago

Just for some context: The bug started to appear when I made this change:
https://github.com/os-autoinst/scripts/commit/5e1d67a27a68600f8bdfca92bd151d32fc657226
Before that change both openqa-cli commands were in one line (the last), so emptying STDIN wasn't doing any harm.

Actions #14

Updated by tinita 6 months ago

We might actually want to change openqa-cli to not read from STDIN unless explicitly requested.
It's something I wouldn't have expected to happen here.

Actions #15

Updated by tinita 6 months ago

  • Copied to action #160820: openqa-cli: Do not read from STDIN unless explicitly requested size:S added
Actions #16

Updated by okurz 5 months ago

  • Due date deleted (2024-06-05)
Actions

Also available in: Atom PDF