Project

General

Profile

Actions

action #70774

open

coordination #80142: [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert

save_needle Minion tasks fail frequently and needles could get lost

Added by mkittler about 4 years ago. Updated 5 months ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2020-09-01
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

The save_needle Minion task fails frequently on OSD and also sometimes on o3.

This can be observed using the following query parameters: https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=save_needle
I'm going to remove most of these jobs to calm down the alert but right now 24 jobs have piled up over 2 month. However, the problem actually exists longer than 2 month but the failures have been manually cleaned up so far.

The problem here is always that the Git working tree is in a state which can not be handled by the task:

1.

  "result" => {
    "error" => "<strong>Failed to save addon_products-module-dev-tools-pvm-20200805.</strong><br><pre>Unable to commit via Git: On branch master\nYour branch is up to date with 'origin/master'.\n\nnothing to commit, working tree clean</pre>"
  },

2.

  "result" => {
    "error" => "<strong>Failed to save manually_add_profile-AppArmor-Chose-a-program-to-generate-a-profile-20200827.</strong><br><pre>Unable to reset repository to origin/master: error: cannot rebase: Your index contains uncommitted changes.\nerror: Please commit or stash them.</pre>"
  },

Acceptance criteria

  • AC1: The save_needle task can handle problematic situations mentioned below.

Suggestions

It would be useful if the task would be able to handle the problematic situations itself instead of requiring manual intervention. Note that the delete_needle task (which shares the same Git code) is also affected. We have likely less problems there because that task is not executed that often.

Problematic situations

  1. No diff has been produced which could be committed: Maybe that's simply when there's no actual change and we can simply return early in that case.
  2. The Git directory contains uncommitted changes: We could save these changes on a new branch before rebasing.
  3. We can not push the new commit because in the meantime new commits have been pushed to the remote from elsewhere: Just repeat the procedure.
  4. The fetch needles script is interfering.

Related issues 4 (0 open4 closed)

Related to openQA Project - coordination #33745: [epic] Improve handling of external Git repositories (for needles)Resolvedmkittler2024-06-20

Actions
Related to openQA Infrastructure - action #61221: osd: unable to save needles, minion fails with "fatal: Unable to create '/var/lib/openqa/.../needles/.git/index.lock'"Resolvedokurz2019-12-20

Actions
Related to openQA Infrastructure - action #98499: [alert] web UI: Too many Minion job failures alert size:SResolvedmkittler2021-09-13

Actions
Has duplicate openQA Project - action #75070: save_needle minion task fails because "Your branch is ahead of 'origin/master'"Rejected2020-10-22

Actions
Actions #1

Updated by mkittler about 4 years ago

  • Related to coordination #33745: [epic] Improve handling of external Git repositories (for needles) added
Actions #2

Updated by okurz about 4 years ago

  • Target version set to Ready
Actions #3

Updated by mkittler about 4 years ago

  • Tags set to alert
Actions #4

Updated by okurz about 4 years ago

  • Description updated (diff)
  • Category set to Feature requests

I think you once already proposed a solution to "repair" the git state. Can you reference that again so that we can think about what we can take from that?

Actions #5

Updated by okurz about 4 years ago

  • Related to action #61221: osd: unable to save needles, minion fails with "fatal: Unable to create '/var/lib/openqa/.../needles/.git/index.lock'" added
Actions #6

Updated by okurz about 4 years ago

  • Target version changed from Ready to future
Actions #7

Updated by Xiaojing_liu about 4 years ago

  • Has duplicate action #75070: save_needle minion task fails because "Your branch is ahead of 'origin/master'" added
Actions #8

Updated by mkittler about 3 years ago

  • Related to coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alert added
Actions #9

Updated by okurz about 3 years ago

  • Related to action #98499: [alert] web UI: Too many Minion job failures alert size:S added
Actions #10

Updated by okurz about 3 years ago

  • Parent task set to #96263
Actions #11

Updated by livdywan almost 3 years ago

From yesterday:

---
args:
- commit_message: ''
  imagedir: ''
  imagedistri: ''
  imagename: hexchat-23.png
  imageversion: ''
  job_id: 7760510
  needle_json: "{\r\n  \"area\": [\r\n    {\r\n      \"ypos\": 181,\r\n      \"type\":
    \"match\",\r\n      \"xpos\": 335,\r\n      \"click_point\": {\r\n        \"ypos\":
    18,\r\n        \"xpos\": 119\r\n      },\r\n      \"height\": 34,\r\n      \"width\":
    180\r\n    }\r\n  ],\r\n  \"properties\": [],\r\n  \"tags\": [\r\n    \"hexchat-nick-bernhard\"\r\n
    \ ]\r\n}"
  needledir: /var/lib/openqa/share/tests/sle/products/sle/needles
  needlename: hexchat-nick-bernhard--20211130
  overwrite: '1'
  user_id: 175
attempts: 1
children: []
created: 2021-11-30T11:20:59.50934Z
delayed: 2021-11-30T11:20:59.50934Z
expires: 2021-11-30T11:21:59.50934Z
finished: 2021-11-30T11:21:01.01916Z
id: 3536401
lax: 0
notes:
  gru_id: 30650350
parents: []
priority: 20
queue: default
result:
  error: "<strong>Failed to save hexchat-nick-bernhard--20211130.</strong><br><pre>Unable
    to commit via Git: On branch master\nYour branch is up to date with 'origin/master'.\n\nUntracked
    files:\n  (use \"git add <file>...\" to include in what will be committed)\n\tnautilus-1-20211102.json\n\tnautilus-1-20211102.png\n\tseahorse_sshkey-seahorse-display-sshkey-20211022.json\n\tseahorse_sshkey-seahorse-display-sshkey-20211022.png\n\nnothing
    added to commit but untracked files present (use \"git add\" to track)</pre>"
retried: ~
retries: 0
started: 2021-11-30T11:20:59.51323Z
state: failed
task: save_needle
time: 2021-12-01T10:16:52.2364Z
worker: 575

The error is this one:

Failed to save hexchat-nick-bernhard--20211130.

Unable to commit via Git: On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
(use "git add <file>..." to include in what will be committed)
    nautilus-1-20211102.json
    nautilus-1-20211102.png
    seahorse_sshkey-seahorse-display-sshkey-20211022.json
    seahorse_sshkey-seahorse-display-sshkey-20211022.png

    nothing added to commit but untracked files present (use "git add" to track)
Actions #12

Updated by livdywan over 2 years ago

We have some very similar looking new cases:

<strong>Failed to save foreground-winget-install-20220621.</strong><br><pre>Unable
to commit via Git: On branch master\nYour branch is up to date with 'origin/master'.\n\nUntracked
files:\n  (use \"git add <file>...\" to include in what will be committed)\n\tfirefox-private-facebook-20220412.json\n\tfirefox-private-facebook-20220412.png\n\tnautilus-1-20211102.json\n\tnautilus-1-20211102.png\n\tseahorse_sshkey-seahorse-display-sshkey-20211022.json\n\tseahorse_sshkey-seahorse-display-sshkey-20211022.png\n\tsystem-indicator-20220511.json\n\tsystem-indicator-20220511.png\n\tyast2_lan_hostname_tab-20220615.json\n\tyast2_lan_hostname_tab-20220615.png\n\nnothing
added to commit but untracked files present (use \"git add\" to track)\n</pre>

And another one with a different error:

<strong>Failed to save foreground-winget-install-20220621.</strong><br><pre>Unable to commit via Git: fatal: Unable to create '/var/lib/openqa/share/tests/sle/products/sle/needles/.git/index.lock': File exists.

    Another git process seems to be running in this repository, e.g.
    an editor opened by 'git commit'. Please make sure all processes
    are terminated then try again. If it still fails, a git process
    may have crashed in this repository earlier:
    remove the file manually to continue.</pre>
Actions #13

Updated by mkittler about 2 years ago

The last error you've mentioned is fixed by https://github.com/os-autoinst/openQA/pull/4835.

Actions #14

Updated by livdywan over 1 year ago

Apparently it's coming back despite our best efforts to ignore it ;-) Of course the impact is limited, and it's not preventing people from being able to save needles.

---
args:
- commit_message: ''
  imagedir: ''
  imagedistri: ''
  imagename: addon_products_sle-102.png
  imageversion: ''
  job_id: 11658202
  needle_json: "{\r\n  \"area\": [\r\n    {\r\n      \"xpos\": 25,\r\n      \"ypos\":
    271,\r\n      \"width\": 229,\r\n      \"height\": 20,\r\n      \"type\": \"match\"\r\n
    \   }\r\n  ],\r\n  \"properties\": [],\r\n  \"tags\": [\r\n    \"addon-products-all_packages-sdk-highlighted\",\r\n
    \   \"addon-products-all_packages-sdk-selected\"\r\n  ]\r\n}"
  needledir: /var/lib/openqa/share/tests/sle/products/sle/needles
  needlename: addon_products_sle-addon-products-all_packages-sdk-highlighted-20230726
  overwrite: ~
  user_id: 160
attempts: 1
children: []
created: 2023-07-26T03:08:42.911954Z
delayed: 2023-07-26T03:08:42.911954Z
expires: 2023-07-26T03:09:42.911954Z
finished: 2023-07-26T03:09:28.780636Z
id: 7876467
lax: 0
notes:
  gru_id: 34515825
parents: []
priority: 20
queue: default
result:
  error: |-
    <strong>Failed to save addon_products_sle-addon-products-all_packages-sdk-highlighted-20230726.</strong><br><pre>Unable to commit via Git: On branch master
    Your branch is up to date with 'origin/master'.

    nothing to commit, working tree clean
    </pre>
retried: ~
retries: 0
started: 2023-07-26T03:08:42.996281Z
state: failed
task: save_needle
time: 2023-07-26T11:03:53.714482Z
worker: 1429

We are seeing quite a lot at the moment, though:

7876467 save_needle default 8 hours ago
failed a minute

7876054 save_needle default 9 hours ago
failed a few seconds

7871427 save_needle default 20 hours ago

failed a few seconds

7869487 save_needle default a day ago

failed a few seconds

7869480 save_needle default a day ago

failed a few seconds

7867900 save_needle default a day ago

failed a few seconds

7866854 save_needle default a day ago

failed a few seconds

7865685 save_needle default a day ago

failed a few seconds

7864283 save_needle default a day ago

failed a few seconds

7864152 save_needle default a day ago

failed a few seconds

Actions #15

Updated by okurz over 1 year ago

I suspect that "Failed to save … nothing to commit, working tree clean" means that there is another parallel save_needles job that already handled that. Or the fetchneedles run every minute is "handling" such case, maybe by pruning uncommited files from git repos.

Actions #16

Updated by mkittler 5 months ago

  • Subject changed from save_needle Minion tasks fail frequently to save_needle Minion tasks fail frequently and needles could get lost
  • Description updated (diff)
Actions

Also available in: Atom PDF