TronMC - AWS Outage – Incident details

AWS Outage

Resolved
Degraded performance
Started 6 days agoLasted about 7 hours

Affected

Domains

Degraded performance from 4:09 PM to 10:53 PM

cdn.tronmc.com

Degraded performance from 4:09 PM to 10:53 PM

Databases

Degraded performance from 4:09 PM to 10:53 PM

Cluster 1: AWS / N. Virginia (us-east-1) Node

Degraded performance from 4:09 PM to 10:53 PM

Updates
  • Resolved
    Resolved


    FROM: AMZN

    Oct 20 3:53 PM PDT Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time. At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints. After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB. As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch. We recovered the Network Load Balancer health checks at 9:38 AM. As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations. Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.

    The issue has now been resolved

  • Monitoring
    Monitoring


    From: AMZN

    Oct 20 2:48 PM PDT We have restored EC2 instance launch throttles to pre-event levels and EC2 launch failures have recovered across all Availability Zones in the US-EAST-1 Regions. AWS services which rely on EC2 instance launches such as Redshift are working through their backlog of EC2 instance launches successfully and we anticipate full recovery of the backlog over the next two hours. We can confirm that Connect is handling new voice and chat sessions normally. There is a backlog of analytics and reporting data that we must process and anticipate that we will have worked through the backlog over the next two hours. We will provide an update by 3:30 PM PDT.

  • Identified
    Identified

    AWS (one of our providers) is currently experiencing a global outage, this may interfere with some of TronMC’s systems. We will update you as they update us.