Load on one of our Outscale K8S cluster node

Incident Report for Elium

Resolved

We performed several tests (including the deployment of a new version of the Elium services) to validate that the new node is stable.
Posted Jul 08, 2021 - 18:16 CEST

Monitoring

We have created a new node using different hardware specifications (CPU type). After several tests, we found that the abnormal load problem no longer occurs on this type of machine. We continue to monitor the behaviour of this node. At the same time, we are reporting our findings to 3DS Outscale support in order to validate that the problem comes from the type of machine used for this node.
Posted Jul 08, 2021 - 14:11 CEST

Update

We still testing different configurations for the faulty node (different kernel version, create another node).
Posted Jul 08, 2021 - 11:34 CEST

Update

We completely recreated the node and redeployed the services. The load continues to increase abnormally and this impacts the customer instances. We have therefore, once again, disabled the services on this node.
Posted Jul 08, 2021 - 10:46 CEST

Update

We are trying to solve the node load problem. This creates slowness on the instances of clients hosted on our private hosting (Outscale) when the services restart on the node.
Posted Jul 08, 2021 - 10:27 CEST

Identified

During rolling updates, restarting containers on the node produces timeouts
Posted Jul 07, 2021 - 22:41 CEST

Update

Restarting the node solved the load problem. We are still checking why this load occurred. Currently, the services are working properly again.
Posted Jul 07, 2021 - 19:22 CEST

Investigating

We have detected an abnormal load on one of the nodes of our Outscale kubernetes cluster. We had to restart it.
Posted Jul 07, 2021 - 19:20 CEST
This incident affected: Private Hosting.