Involuntary Security Upgrade
Resolved
Feb 17 at 05:47pm CET
Postmortem: Involuntary Security Upgrade
Incident Timeline
- 14:43 CET: Initiation of a security upgrade.
- 14:58 CET: First service began experiencing significant degradations, escalating to disruptions.
- 14:58 - 16:34 CET: Continued degradations and disruptions as the upgrade progressed.
- 16:34 CET: Completion of the upgrade, restoring the cluster to its operational state.
- Post-Upgrade Inspection: Document ingestion was found to be non-functional.
- 17:47 CET: Document ingestion pipelines were repaired, fully resolving the incident.
Root Cause Analysis
- The upgrade was intended to update the system to a newer Kubernetes version to enhance maintainability and security.
- The upgrade should have been scheduled outside of business hours to minimize impact during periods of high load.
- An uncoordinated change led to the upgrade being triggered prematurely during peak operational hours.
- This premature upgrade caused significant strain on Kubernetes, leading to the observed service disruptions.
Impact Assessment
- Duration of Severe Degradations: 1 hour and 33 minutes (partial system availability was maintained).
- Affected Services: unique.app multitenant features (Chat and Recording). Other tenants remained unaffected.
Preventative Measures
- Implement additional safeguards in the infrastructure upgrade process to prevent unscheduled changes.
- Allocate more resources during daytime operations to allow for cluster rebalancing, reducing the impact of involuntary disruptions.
- Further segment unique workloads into smaller, independently upgradable pools to minimize the blast radius of similar incidents in the future.
Affected services
Updated
Feb 17 at 04:34pm CET
An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.
Affected services
Created
Feb 17 at 02:43pm CET
An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.
Affected services