Resolved
Feb 17 at 05:47pm CET

Postmortem: Involuntary Security Upgrade

Incident Timeline

14:43 CET: Initiation of a security upgrade.
14:58 CET: First service began experiencing significant degradations, escalating to disruptions.
14:58 - 16:34 CET: Continued degradations and disruptions as the upgrade progressed.
16:34 CET: Completion of the upgrade, restoring the cluster to its operational state.
Post-Upgrade Inspection: Document ingestion was found to be non-functional.
17:47 CET: Document ingestion pipelines were repaired, fully resolving the incident.

Root Cause Analysis

The upgrade was intended to update the system to a newer Kubernetes version to enhance maintainability and security.
The upgrade should have been scheduled outside of business hours to minimize impact during periods of high load.
An uncoordinated change led to the upgrade being triggered prematurely during peak operational hours.
This premature upgrade caused significant strain on Kubernetes, leading to the observed service disruptions.

Impact Assessment

Duration of Severe Degradations: 1 hour and 33 minutes (partial system availability was maintained).
Affected Services: unique.app multitenant features (Chat and Recording). Other tenants remained unaffected.

Preventative Measures

Implement additional safeguards in the infrastructure upgrade process to prevent unscheduled changes.
Allocate more resources during daytime operations to allow for cluster rebalancing, reducing the impact of involuntary disruptions.
Further segment unique workloads into smaller, independently upgradable pools to minimize the blast radius of similar incidents in the future.

Updated
Feb 17 at 04:34pm CET

An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.

Created
Feb 17 at 02:43pm CET

An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.