0%
#Type of settings

Incident Post Mortem: January 29, 2021

2021년 2월 5일 6 분 읽기
뉴스 기사 배너 이미지

The Outage

A group of oncall engineers convened after being paged for high error rates across many endpoints. It was quickly diagnosed as an issue with a Redis cluster used to store spot rates for currency conversions. These spot rates are used for estimated conversions between currencies, which puts it in the critical path for a range of functionality such as displaying the value of your portfolio in your local currency. The CPU load on this Redis cluster suddenly hit 100% and was causing several API endpoints and background jobs to fail. We immediately moved to reduce load on this cluster so that it could recover. Our monitoring indicated that the majority of the load on this cluster was coming from background jobs, so we began disabling them, expecting to see query throughput drop.

After several cycles of disabling jobs, load on the cluster was still high, so a cluster failover was attempted to see if the failure was due to a hardware issue. This, too, had no effect. We were also unable to deploy a code change intended to mitigate the issue because the unhealthy Redis cluster was preventing new application servers from booting. An attempt to vertically scale the cluster in place also failed to reduce CPU pressure.

We finally opted to provision a new, much larger Redis cluster to replace the overloaded one. Once it booted, we pointed a fresh application deployment at it, and finally restored service to our customers. However, several questions remain unanswered at this point, prompting us to dig in and uncover the actual root cause.

Root Cause Analysis

In the days since the outage, we have reconstructed a clear picture of what happened during the first minute.

  1. 00:00s — A Memcached node reboots. This Memcached deployment provides a read-through cache in front of the keys that store exchange rates in the Redis cluster that failed. The reboot corresponded with a brief spike in cache misses while the clients shifted their queries to another Memcached cluster member. This is not an unprecedented occurrence, and in this case the Memcached clients recovered quickly as expected.

Memcached client errors

2. 00:02s — Concurrent Redis queries spike and connection pools fill up. The simultaneous cache misses due to the Memcached node restart caused a stampede of queries for the underlying data stored in the Redis cluster. The increased volume of concurrent queries required a 50x increase in the number of Redis connections to serve them.

Redis connection count

3. 00:07s — We see a high rate of Redis client timeout errors. The sudden influx of connections created a slow-down in the Redis engine. CPU utilization jumped to 100% and it stopped being able to respond to queries within a 1 second timeout. At this point, the error rate on our price data API endpoints was around 90%. These timeouts prevented memcached from replenishing its cache, so load on this Redis cluster remained high from this point on.

Redis client timeout errors

4. 00:13s — The 5 second cache TTL at the CDN for all public price data expired. Normally, load on the application for these endpoints is stable, thanks to the CDN cache. During this time, it spiked to 5x normal volume because the CDN began forwarding requests to the app to replenish its cache for all asset price data. Due to the 90% error rate on these endpoints, it was unable to successfully refresh the cache for the vast majority of keys. The application code fell through to Redis to load this data because of the Memcached misses, which further increased load on the already overloaded cluster.

CDN response codes

At this point, the Redis cluster was receiving a throughput of queries many times its normal load, and the layers of cache above it (Memcached and the CDN) were unable to replenish because of Redis client timeouts and connection failures. Any requests which relied on exchange rate data were failing, and users were therefore unable to access critical functionality on our site. The cycle of cache misses and client timeouts kept this Redis cluster overloaded for the duration of the outage despite several attempts to reduce load.

Looking Ahead

With the clarity of hindsight, we’ve identified some blind spots in our monitoring that prevented us from quickly understanding the failure.

  • Our client-side Redis monitoring did not include timed out queries. This made it impossible for us to see that the application was overloading Redis, since we were only looking at successful calls, which decreased during the outage.

  • Our server-side Redis monitoring did not include failed calls. This made for a confusing dashboard that showed a sharp increase in network activity, but no apparent query activity. Our metrics are derived from the Redis INFO commandstats output, and did not include failed_calls or rejected_calls.

  • The Memcached node restart was only discovered after the incident because this information was not surfaced on any of our dashboards.

In addition, our application was configured in a way that allowed an initial small failure to cascade into a full-blown outage.

  • Client connection pools grew to 50x size almost instantly, overwhelming Redis with new connections. To counter this, we’ve tuned connection pool sizes to have more generous minimums and sane maximums.

  • The Redis cluster was underprovisioned. We’ve replaced it with an appropriately sized cluster and have audited our remaining clusters to ensure they are correctly scaled.

  • We were not making use of read-only replicas on this cluster. We are exploring wider use of Redis secondary reads to better scale this and similar use cases.

Another important observation is that an outage in one area of our API was able to interrupt service across all of it. We are currently working to decompose our monolithic application server into separate services in order to prevent these types of outages going forward. Pulling this part of the API into a separately scalable unit will let us isolate it from issues elsewhere, and prevent it from disrupting other features.

We take the uptime and performance of our infrastructure very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving scaling challenges like those presented here, come work with us .

was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

인기 뉴스

How to Set Up and Use Trust Wallet for Binance Smart Chain
#Bitcoin#Bitcoins#Config+2 더 많은 태그

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)
#Subscriptions

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

What is Grid Trading? (A Crypto-Futures Guide)

Cryptohopper에서 무료로 거래를 시작하세요!

무료 사용 - 신용카드 필요 없음

시작하기
Cryptohopper appCryptohopper app

면책 조항: Cryptohopper는 규제 기관이 아닙니다. 암호화폐 봇 거래에는 상당한 위험이 수반되며 과거 실적이 미래 결과를 보장하지 않습니다. 제품 스크린샷에 표시된 수익은 설명용이며 과장된 것일 수 있습니다. 봇 거래는 충분한 지식이 있거나 자격을 갖춘 재무 고문의 조언을 구한 경우에만 참여하세요. Cryptohopper는 어떠한 경우에도 (a) 당사 소프트웨어와 관련된 거래로 인해, 그로 인해 또는 이와 관련하여 발생하는 손실 또는 손해의 전부 또는 일부 또는 (b) 직접, 간접, 특별, 결과적 또는 부수적 손해에 대해 개인 또는 단체에 대한 어떠한 책임도 지지 않습니다. Cryptohopper 소셜 트레이딩 플랫폼에서 제공되는 콘텐츠는 Cryptohopper 커뮤니티 회원이 생성한 것이며 Cryptohopper 또는 그것을 대신한 조언이나 추천으로 구성되지 않는다는 점에 유의하시기 바랍니다. 마켓플레이스에 표시된 수익은 향후 결과를 나타내지 않습니다. Cryptohopper의 서비스를 사용함으로써 귀하는 암호화폐 거래와 관련된 내재적 위험을 인정하고 수락하며 발생하는 모든 책임이나 손실로부터 Cryptohopper를 면책하는 데 동의합니다. 당사의 소프트웨어를 사용하거나 거래 활동에 참여하기 전에 당사의 서비스 약관 및 위험 공개 정책을 검토하고 이해하는 것이 필수적입니다. 특정 상황에 따른 맞춤형 조언은 법률 및 재무 전문가와 상담하시기 바랍니다.

©2017 - 2024 저작권: Cryptohopper™ - 판권 소유.