Back to overview
Degraded

Involuntary Security Upgrade

Feb 17 at 02:43pm CET
Affected services
Microsoft
Google Calendar
Google Meets
Chat
Knowledge Base
Recording
Sidebar
IDP
Chat
Knowledge/Ingestion
Scope Management
Recording
Sidebar
Webhooks

Resolved
Feb 17 at 05:47pm CET

Postmortem: Involuntary Security Upgrade

Incident Timeline

  1. 14:43 CET: Initiation of a security upgrade.
  2. 14:58 CET: First service began experiencing significant degradations, escalating to disruptions.
  3. 14:58 - 16:34 CET: Continued degradations and disruptions as the upgrade progressed.
  4. 16:34 CET: Completion of the upgrade, restoring the cluster to its operational state.
  5. Post-Upgrade Inspection: Document ingestion was found to be non-functional.
  6. 17:47 CET: Document ingestion pipelines were repaired, fully resolving the incident.

Root Cause Analysis

  1. The upgrade was intended to update the system to a newer Kubernetes version to enhance maintainability and security.
  2. The upgrade should have been scheduled outside of business hours to minimize impact during periods of high load.
  3. An uncoordinated change led to the upgrade being triggered prematurely during peak operational hours.
  4. This premature upgrade caused significant strain on Kubernetes, leading to the observed service disruptions.

Impact Assessment

  • Duration of Severe Degradations: 1 hour and 33 minutes (partial system availability was maintained).
  • Affected Services: unique.app multitenant features (Chat and Recording). Other tenants remained unaffected.

Preventative Measures

  1. Implement additional safeguards in the infrastructure upgrade process to prevent unscheduled changes.
  2. Allocate more resources during daytime operations to allow for cluster rebalancing, reducing the impact of involuntary disruptions.
  3. Further segment unique workloads into smaller, independently upgradable pools to minimize the blast radius of similar incidents in the future.

Updated
Feb 17 at 04:34pm CET

An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.

Created
Feb 17 at 02:43pm CET

An unplanned security upgrade to a newer Kubernetes version during peak hours caused significant service degradations and disruptions for 1 hour and 33 minutes, affecting Unique.app's multitenant features. The incident was resolved by 17:47 CET after repairing document ingestion pipelines. To prevent future occurrences, we will implement stricter safeguards, allocate more resources for daytime operations, and segment workloads for independent upgrades.