action #123499
closedcoordination #120660: [epic] Clean up old kubernetes jobs automatically
PCW: Cleanup old jobs in google kubernetes
Added by ilausuch over 1 year ago. Updated over 1 year ago.
0%
Description
Updated by ilausuch over 1 year ago
- Subject changed from Cleanup jobs in google kubernetes to Cleanup old jobs in google kubernetes
Updated by ilausuch over 1 year ago
Currently, a similar code we used on amazon is not using because the kubernetes python library is using the batch to list and clean the jobs.
This generates the error:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch is forbidden: User \"system:anonymous\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope","reason":"Forbidden","details":{"group":"batch","kind":"jobs"},"code":403}
With the same credentials, using the client works
kubectl get jobs
NAMESPACE NAME COMPLETIONS DURATION AGE
default jlausuch-vm-962 0/1 186d 186d
default jlausuch-vm-964 0/1 186d 186d
default jlausuch-vm-965 0/1 185d 185d
default jlausuch-vm-966 0/1 185d 185d
default openqa-9980567 1/1 25m 67d
default openqa-9980614 1/1 16m 67d
default openqa-9980622 1/1 6m39s 67d
helm-ns-10329748 helm-test-10329748-job 1/1 4s 5d22h
``
I am researching why this happens and if it can be fixed to continue using kubernetes library
Updated by ilausuch over 1 year ago
This project shows how to enable the roleBinding for their tool. Maybe is useful
https://github.com/sukeesh/k8s-job-notify#to-start-using-this
Update:
I get the same error.
We have to consider than when you use locally this kubeconf, first you have to authorize (gcloud login) that gives you a link to authorize on a browser. In our tests we are using oath2, this is the reason I don't think we can use the kubeconf directly.
Updated by ilausuch over 1 year ago
Now I am going to try to follow these code https://stackoverflow.com/questions/54410410/authenticating-to-gke-master-in-python
But I am getting for the call list_job_for_all_namespaces
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '88d4467a-a27d-4aa7-9c87-138468a3eae8', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 23 Jan 2023 14:40:35 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
Also happens with the call list_pod_for_all_namespaces
Updated by ilausuch over 1 year ago
With the new approach we are using our programmatic user.
- I added the role Kubernetes Engine Developer (doesn't work yet)
- I understand I have to add the service and rolebinding to the RBAC
Following https://cloud.google.com/kubernetes-engine/docs/how-to/kubernetes-service-accounts and https://cloud.google.com/anthos/identity/setup/bearer-token-auth
Updated by ilausuch over 1 year ago
A PR that works
https://github.com/SUSE/pcw/pull/189
Updated by ilausuch over 1 year ago
- Blocked by action #123730: PCW: Create a container to clean up the leftovers of the kubernetes clusters added
Updated by ilausuch over 1 year ago
- Status changed from In Progress to Blocked
Blocked until https://progress.opensuse.org/issues/123730 is solved. Will be parallel tasks
Updated by ilausuch over 1 year ago
- Blocks action #123502: PCW: Cleanup old jobs in azure kubernetes added
Updated by ilausuch over 1 year ago
- Blocks action #124664: PCW: Move the kubernetes cleanup for jobs in Amazon to the cleanup_k8s script added
Updated by ilausuch over 1 year ago
- Subject changed from Cleanup old jobs in google kubernetes to PCW: Cleanup old jobs in google kubernetes
Updated by ilausuch over 1 year ago
Updated by ilausuch over 1 year ago
Keep failing. Ansible is not applying well
journalctl -u pcw_k8s.service
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: Starting Executes once the k8s cleanup...
Mar 01 00:00:11 publiccloud-ng.qa.suse.de podman[2260]: Error: unknown shorthand flag: 'r' in -rm
Mar 01 00:00:11 publiccloud-ng.qa.suse.de podman[2260]: See 'podman container run --help'
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: pcw_k8s.service: Main process exited, code=exited, status=125/n/a
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: pcw_k8s.service: Failed with result 'exit-code'.
Mar 01 00:00:11 publiccloud-ng.qa.suse.de systemd[1]: Failed to start Executes once the k8s cleanup.
Updated by ilausuch over 1 year ago
This was an old failure that is already fixed by https://gitlab.suse.de/qac/publiccloud-qa-suse-de/-/merge_requests/73
Now the problem is that eks class hiherit from ec2 that needs the pcconfig
Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]: from .provider import Provider
Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]: File "/pcw/ocw/lib/provider.py", line 10, in <module>
Mar 02 00:00:39 publiccloud-ng.qa.suse.de podman[2324]: from webui.settings import PCWConfig
Updated by ilausuch over 1 year ago
- Status changed from In Progress to Resolved
Work done, it's working now