0%

Incident Post Mortem: May 19, 2021

27 мая 2021 г. 4 мин чтения
Изображение Баннера Новостной Статьи

The Outage

There was a large spike of traffic due to many users reacting to a sudden price drop in the crypto market leading up to this incident (ETH dropped 20%, BTC dropped 25%). A group of oncall engineers convened after being paged for high error rates across several services.

The affected services were:

  • Logged out web servers : This caused users that weren’t logged-in to hit an error page when visiting coinbase.com.

  • GraphQL service : This caused parts of the mobile app to load very slowly and error ~10% of the time.

  • Coinbase Pro API : This caused Coinbase Pro to be partially unreachable.

  • Non-US card payment processing service : This caused non-US customers attempting to buy crypto with a card to be rejected.

Once these issues were identified, engineers split into different groups to investigate each issue in parallel and prioritize follow up actions.

Root Cause Analysis

In the days since the outage, we have reconstructed a clear picture of what happened since the first minute.

  1. The Logged out coinbase.com pages were largely unreachable as the instances started failing and took over 40 minutes to return to a healthy state. The rapid spike in requests ended up hitting a max threshold in Nginx router connections, which was manually increased during the incident. This ultimately addressed the bottleneck.

NodeJS HTML Response

2. We saw timeouts and increased latency on our GraphQL service, which aggregates data from underlying services. The timeouts were caused by GraphQL autoscaling up too slowly. The autoscaling eventually caught up and the errors subsided, restoring functionality to the mobile app and logged-in users.

GraphQL Errors

3. We saw that the database that powers the Coinbase Pro exchange had high latency and CPU load. Additionally the API servers that run our market data feed were under high CPU load. We increased the operation throughput configured on the database and also provisioned more API servers.

Coinbase Pro API Response Time

4. In our Non-US card payment processing service, the number of failed payments increased as the queue to process the payments became backlogged. We increased the number of queue workers and card payments started succeeding.

Queue Size

Improvements

At Coinbase, we’ve committed significant resources to improving our reliability, including regular load tests to prepare us for high periods of traffic. However, this incident has identified some blind spots to address, especially around very sudden spikes of traffic.

A common theme around several of the failures in this incident were autoscaling rules that weren’t tuned to the nature of traffic spikes that crypto markets can cause. We’re working on tailoring our load tests to better simulate real world situations, such as sudden traffic spikes. This will help surface more issues like untuned autoscaling rules, during controlled testing.

Another improvement that we are investing in is the implementation of kill switches for parts of the client application so that when failures happen, we can keep unaffected parts of our applications working while we work to address the failures.

We take the uptime and performance of our infrastructure very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving scaling challenges like those presented here, come work with us.

was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Популярные новости

How to Set Up and Use Trust Wallet for Binance Smart Chain
#Bitcoin#Bitcoins#Config+2 дополнительные теги

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)
#Subscriptions

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

What is Grid Trading? (A Crypto-Futures Guide)

Начните торговать с Cryptohopper бесплатно!

Бесплатное использование (кредитная карта не требуется)

Приступим
Cryptohopper appCryptohopper app

Отказ от ответственности: Cryptohopper не является регулируемой организацией. Торговля криптовалютами с помощью ботов связана с существенными рисками, и прошлая эффективность не являются признаком такой же эффективности их применения в будущем. Прибыль, показанная на скриншотах продукта, приведена для примера и может быть преувеличена. Занимайтесь торговлей с помощью ботов только в том случае, если обладаете достаточными знаниями, или обратитесь за советом к квалифицированному финансовому консультанту. Ни при каких обстоятельствах Cryptohopper не несет ответственности перед любым физическим или юридическим лицом за (а) любые убытки или ущерб, полностью или частично, вызванные, возникшие в результате или в связи с транзакциями с использованием нашего программного обеспечения, или (б) любые прямые, косвенные, особенные, последующие или случайные убытки. Пожалуйста, обратите внимание, что контент, доступный на социальной торговой платформе Cryptohopper, создаётся членами сообщества Cryptohopper и не является советом или рекомендацией Cryptohopper или от его имени. Прибыль, показанная в Маркетплейсе (Торговой площадке), не является индикатором будущих результатов. Используя услуги Cryptohopper, вы признаёте и принимаете риски, присущие торговле криптовалютой, и соглашаетесь оградить Cryptohopper от любых обязательств или понесенных убытков. Прежде чем использовать наше программное обеспечение или участвовать в любой торговой деятельности, необходимо ознакомиться и понять наши Условия предоставления услуг и Предупреждение о рисках. Пожалуйста, обратитесь к юридическим и финансовым специалистам для получения индивидуального совета, основанного на ваших конкретных обстоятельствах.

©2017 - 2024 Copyright by Cryptohopper™ - Все права защищены.