Partially Degraded Performance [US region]

Incident Report for Social Plus

Postmortem

Incident Date: 2026-01-21
Impact: System degradation and intermittent downtime.
Primary Cause: Infrastructure resource exhaustion triggered by an unprecedented high-volume traffic surge.

1. Summary

On January 21, an unprecedented surge in traffic, peaking at 450,000 requests per minute (~56.25x baseline). While application servers autoscaled successfully, the Core Database became the bottleneck. Despite two manual vertical scaling interventions, the system experienced two periods of degradation before stabilizing as database capacity finally matched the demand.

2. Root Cause

The root cause of the incident was infrastructure resource exhaustion resulting from insufficient database overhead to accommodate a sudden traffic spike.

Traffic Volume: An unprecedented surge in external demand drove platform traffic significantly beyond predicted growth, increasing from a baseline of 8,000 req/min to a peak of 450,000 req/min (a ~56.25x increase).
Scaling Operation Time: Vertical scaling of the Core Database required a 10–30 minute operation time per event. During these intervals, the system remained degraded as incoming demand outpaced both available capacity and recovery speed.

3. Optimizations & Corrective Actions

Based on the investigation, we will implement the following technical safeguards:

A. Transition Impacted Queries to Secondary Nodes

Action: Reconfigure remaining database queries to target Secondary (Read) Replicas rather than the Primary node.
Goal: Offload significant pressure from the Primary database. By reducing the load on the Primary node, we ensure it retains enough resource overhead to improve the scaling and recovery time. This prevents the Primary from being choked by contention, allowing it to complete vertical scaling operations much faster during a surge.

B. Optimize Autoscaling Performance (Server & Database)

Action: Review and tune autoscaling policies for both the App Tier and Database Tier to specifically reduce operation time.
Goal: Decrease the "Time-to-Ready" for new resources. By optimizing scaling triggers and resource warm-up procedures, we ensure capacity is provisioned more rapidly, improving the system's overall recovery time during a sudden spike.

Posted Jan 22, 2026 - 21:34 GMT+07:00

Resolved

This incident has been resolved.

Posted Jan 21, 2026 - 22:58 GMT+07:00

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 21, 2026 - 22:37 GMT+07:00

Investigating

We're investigating an alert on our social and realtime performance. You may experience delay in social and realtime connection or response time.

Posted Jan 21, 2026 - 21:00 GMT+07:00

This incident affected: Social+ Cloud (Core Services (US)).