action #78127
closedfollow-up to #73633 - lessons learned and suggestions
Target version:
Start date:
Due date:
% Done:
Estimated time:
Suggestions from #73633#note-30
- A passive performance measurement regarding throughput on interfaces
- whenever we apply changes to the infrastructure we should have a ticket
- Whenever creating any external ticket, e.g. EngInfra, create internal tracker ticket. Because there might be more internal notes
- Same as in OSD deployment we should look for failed grafana
- Collect all the information between "last good" and "first bad" and then also find the git diff in openqa/salt-states-openqa
- Apply proper "scientific method" with written down hypotheses, experiments and conclusions in tickets, follow
- Keep salt states to describe what should not be there
- Try out older btrfs snapshots in systems for crosschecking and boot with disabled salt. In the kernel cmdline append
- team should conduct a work backlog check on a daily base
- nsinger does not mind if someone else provides a suggestion or takes over the ticket
Updated by okurz over 4 years ago
- Copied from action #73633: OSD partially unresponsive, triggering 500 responses, spotty response visible in monitoring panels but no alert triggered (yet) added
Updated by livdywan about 4 years ago
I feel like this is not Workable since it's got all of the suggestions we came up with but it's not clear what result we expect yet i.e. ACs. And I'm not sure why it has an assignee... @okurz did you mean to define the ACs? Or maybe we should have another call to do that?
Updated by okurz about 4 years ago
hm, actually I think all except the first point could be moved to the wiki as "best practices" or "good to know" as is.
Updated by okurz about 4 years ago
- Status changed from Workable to Resolved
- created
- added comments to
- first point added to #65271