Project

General

Profile

Actions

action #123499

closed

coordination #120660: [epic] Clean up old kubernetes jobs automatically

PCW: Cleanup old jobs in google kubernetes

Added by ilausuch over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
2023-01-23
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Clean the jobs in GKE

Acceptance Criteria

  • Use a kubeconf in a file
  • Delete jobs that have more than 1 day of age

Related issues 3 (0 open3 closed)

Blocked by Containers - action #123730: PCW: Create a container to clean up the leftovers of the kubernetes clustersResolvedilausuch2023-01-27

Actions
Blocks Containers - action #123502: PCW: Cleanup old jobs in azure kubernetesResolvedilausuch2023-01-23

Actions
Blocks Containers - action #124664: PCW: Move the kubernetes cleanup for jobs in Amazon to the cleanup_k8s scriptResolvedilausuch2023-02-16

Actions
Actions #1

Updated by ilausuch over 1 year ago

  • Subject changed from Cleanup jobs in google kubernetes to Cleanup old jobs in google kubernetes
Actions #2

Updated by ilausuch over 1 year ago

Currently, a similar code we used on amazon is not using because the kubernetes python library is using the batch to list and clean the jobs.
This generates the error:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:anonymous\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}

With the same credentials, using the client works

kubectl get jobs
NAMESPACE          NAME                     COMPLETIONS   DURATION   AGE
default            jlausuch-vm-962          0/1           186d       186d
default            jlausuch-vm-964          0/1           186d       186d
default            jlausuch-vm-965          0/1           185d       185d
default            jlausuch-vm-966          0/1           185d       185d
default            openqa-9980567           1/1           25m        67d
default            openqa-9980614           1/1           16m        67d
default            openqa-9980622           1/1           6m39s      67d
helm-ns-10329748   helm-test-10329748-job   1/1           4s         5d22h
``

I am researching why this happens and if it can be fixed to continue using kubernetes library
Actions #3

Updated by ilausuch over 1 year ago

This project shows how to enable the roleBinding for their tool. Maybe is useful

https://github.com/sukeesh/k8s-job-notify#to-start-using-this

Update:
I get the same error.
We have to consider than when you use locally this kubeconf, first you have to authorize (gcloud login) that gives you a link to authorize on a browser. In our tests we are using oath2, this is the reason I don't think we can use the kubeconf directly.

Actions #4

Updated by ilausuch over 1 year ago

  • Description updated (diff)
Actions #5

Updated by ilausuch over 1 year ago

Now I am going to try to follow these code https://stackoverflow.com/questions/54410410/authenticating-to-gke-master-in-python

But I am getting for the call list_job_for_all_namespaces

Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '88d4467a-a27d-4aa7-9c87-138468a3eae8', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 23 Jan 2023 14:40:35 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

Also happens with the call list_pod_for_all_namespaces

Actions #6

Updated by ilausuch over 1 year ago

With the new approach we are using our programmatic user.

  • I added the role Kubernetes Engine Developer (doesn't work yet)
  • I understand I have to add the service and rolebinding to the RBAC

Following https://cloud.google.com/kubernetes-engine/docs/how-to/kubernetes-service-accounts and https://cloud.google.com/anthos/identity/setup/bearer-token-auth

Actions #7

Updated by ilausuch over 1 year ago

Actions #8

Updated by ilausuch over 1 year ago

  • Blocked by action #123730: PCW: Create a container to clean up the leftovers of the kubernetes clusters added
Actions #9

Updated by ilausuch over 1 year ago

  • Status changed from In Progress to Blocked

Blocked until https://progress.opensuse.org/issues/123730 is solved. Will be parallel tasks

Actions #10

Updated by ilausuch over 1 year ago

  • Blocks action #123502: PCW: Cleanup old jobs in azure kubernetes added
Actions #11

Updated by ilausuch over 1 year ago

  • Blocks action #124664: PCW: Move the kubernetes cleanup for jobs in Amazon to the cleanup_k8s script added
Actions #12

Updated by ilausuch over 1 year ago

  • Subject changed from Cleanup old jobs in google kubernetes to PCW: Cleanup old jobs in google kubernetes
Actions #14

Updated by ilausuch over 1 year ago

  • Status changed from Blocked to In Progress
Actions #15

Updated by ilausuch over 1 year ago

Keep failing. Ansible is not applying well

journalctl -u pcw_k8s.service
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: Starting Executes once the k8s cleanup...
Mar 01 00:00:11 publiccloud-ng.qa.suse.de podman[2260]: Error: unknown shorthand flag: 'r' in -rm
Mar 01 00:00:11 publiccloud-ng.qa.suse.de podman[2260]: See 'podman container run --help'
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: pcw_k8s.service: Main process exited, code=exited, status=125/n/a
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: pcw_k8s.service: Failed with result 'exit-code'.
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: Failed to start Executes once the k8s cleanup.
Actions #16

Updated by ilausuch over 1 year ago

This was an old failure that is already fixed by https://gitlab.suse.de/qac/publiccloud-qa-suse-de/-/merge_requests/73

Now the problem is that eks class hiherit from ec2 that needs the pcconfig

Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]:     from .provider import Provider
Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]:   File "/pcw/ocw/lib/provider.py", line 10, in <module>
Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]:     from webui.settings import PCWConfig

https://github.com/SUSE/pcw/pull/213

Actions #17

Updated by ilausuch over 1 year ago

  • Status changed from In Progress to Resolved

Work done, it's working now

Actions

Also available in: Atom PDF