ResolvedNetwork Connectivity Issues
AWS experienced network latencies and errors which started around 10:20am PT that were caused by resource contention within the subsystem responsible for propagation of network mappings within the Amazon Virtual Private Cloud.
This affected the elastic nature of our services as we typically scale-in and scale-out API resources due to demand. Early creation and movement of our services, in an attempt to restore operation, introduced latency and manifested in errors when attempting to perform health checks and connectivity timeouts with other supporting services.
By around 9:15pm PT, we were no longer experiencing any resource contention and our systems operated normally.
We continue to monitor the health of our systems. The error rates have fallen significantly and performance has stabilized. AWS is beginning to operate normally.
We will provide a final update when all has been confirmed.
AWS continues to work on their network mapping propagation. Our current plan is to monitor and investigate any additional mitigations in conjunction with the AWS mitigation phases.
We expect partial outage on all Modern Treasury endpoints with elevated rates of 504s.
We will continue to provide updates every 90 minutes, or as we have additional information to share.
The Modern Treasury API offers idempotent requests to prevent accidental duplication of API calls. This feature is particularly useful when initiating actions such as money transfers, entity creation, or resource modifications. For instance, if a gateway timeouts (504) occur while creating a Payment Order, you can safely retry the request using the same idempotency key to ensure that only one payment order is created.
We continue to monitor the health of our services. The error rates have continued to fall and have stabilized at a very low rate.
We expect to be back at healthy operation levels in a few hours, after AWS has completed their updates to address the network connectivity issues and errors affecting the Availability Zones in the us-west-2 region.
You can see their AWS's status updates here: https://health.aws.amazon.com/health/status
We are continuing to work on mitigations. From our monitoring, our error rate has been steadily going down. From the beginning of the incident until now, our error rate has averaged around 1%, and our worst minute window hit a max of 7.2% error rate.
AWS is actively working on fixing the us-west-2 Availability Zone issues, you can see their status update here: https://health.aws.amazon.com/health/status
AWS is experiencing network latencies and errors in multiple availability zones in the us-west-2 region.
We currently have limited impact due to our high availability network layout within the us-west-2 region.
We are currently looking into additional mitigations to further limit impact.