0%

Incident Post Mortem: November 23, 2021

2021년 12월 21일 5 분 읽기
뉴스 기사 배너 이미지

The Incident

On November 23rd, 2021, at 4:00pm PT (Nov 24, 2021 00:00 UTC) an SSL certificate for an internal hostname in one of our Amazon Web Services (AWS) accounts expired. The expired SSL certificate was used by many of our internal load balancers which caused a majority of inter-service communications to fail. Due to the fact that our API routing layer connects to backend services via subdomains of this internal hostname, about 90% of incoming API traffic returned errors.

Error rates returned to normal once we were able to migrate all load balancers to a valid certificate.

Chart depicting overall 90% error rate at our API routing layer for duration of incident.

Context: Certificates at Coinbase

It’s helpful to provide some background information about how we manage SSL certificates at Coinbase. For the most part, certificates for public hostnames like coinbase.com are managed and provisioned by Cloudflare. For certificates for internal hostnames used to route traffic between backend services, we historically leveraged AWS IAM Server Certificates.

One of the downsides of IAM Server Certificates is that certificates must be generated outside of AWS and uploaded via an API call. So last year, our infrastructure team migrated from IAM Server Certificates to AWS Certificate Manager (ACM). ACM solves the security problem because AWS generates both the public and private components of the certificate within ACM and stores the encrypted version in IAM for us. Only connected services like Cloudfront and Elastic Load Balancers will get access to the certificates. Denying the acm:ExportCertificate permission to all AWS IAM Roles ensures that they can’t be exported.

In addition to the added security benefits, ACM also automatically renews certificates before expiration. Given that ACM certificates are supposed to renew and we did a migration, how did this happen?

Root Cause Analysis

Incident responders quickly noticed that the expired certificate was an IAM Server Certificate. This was unexpected because the aforementioned ACM migration had been widely publicized in engineering communication channels at the time; thus we had been operating under the assumption that we were running exclusively on ACM certificates.

As we later discovered, one of the certificate migrations didn’t go as planned; the group of engineers working on the migration uploaded a new IAM certificate and postponed the rest of the migration. Unfortunately, the delay was not as widely communicated as it should have been and changes to team structure and personnel resulted in the project being incorrectly assumed complete.

Migration status aside, you may ask the same question we asked ourselves: “Why weren’t we alerted to this expiring certificate?” The answer is: we were. Alerts were being sent to an email distribution group that we discovered only consisted of two individuals. This group was originally larger, but shrank with the departure of team members and was never sufficiently repopulated as new folks joined the team.

In short, the critical certificate was allowed to expire due all of three factors:

  1. The IAM to ACM migration was incomplete.

  2. Expiration alerts were only being sent via email and were filtered or ignored.

  3. Only two individuals were on the email distribution list.

Resolution & Improvements

In order to resolve the incident we migrated all of the load balancers that were using the expired IAM cert to the existing auto-renewing ACM cert that had been provisioned as part of the original migration plan. This took longer than desired due to the number of load balancers involved and our cautiousness in defining, testing, and applying the required infrastructure changes.

In order to ensure we don’t run into an issue like this again, we’ve taken the following steps to address the factors mentioned in the RCA section above:

  1. We’ve completed the migration to ACM, are no longer using IAM Server Certificates and are deleting any legacy certificates to reduce noise.

  2. We’re adding automated monitoring that is connected to our alerting and paging system to augment the email alerts. These will page on impending expiration as well as when ACM certificates drop out of auto-renewal eligibility.

  3. We’ve added a permanent group-alias to the email distribution list. Furthermore, this group is automatically updated as employees join and leave the company.

  4. We’re building a repository of incident remediation operations in order to reduce time to define, test and apply new changes.

We take the uptime and performance of our infrastructure very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving challenges like those listed here, come work with us.

was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

인기 뉴스

How to Set Up and Use Trust Wallet for Binance Smart Chain
#Bitcoin#Bitcoins#Config+2 더 많은 태그

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)
#Subscriptions

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

What is Grid Trading? (A Crypto-Futures Guide)

Cryptohopper에서 무료로 거래를 시작하세요!

무료 사용 - 신용카드 필요 없음

시작하기
Cryptohopper appCryptohopper app

면책 조항: Cryptohopper는 규제 기관이 아닙니다. 암호화폐 봇 거래에는 상당한 위험이 수반되며 과거 실적이 미래 결과를 보장하지 않습니다. 제품 스크린샷에 표시된 수익은 설명용이며 과장된 것일 수 있습니다. 봇 거래는 충분한 지식이 있거나 자격을 갖춘 재무 고문의 조언을 구한 경우에만 참여하세요. Cryptohopper는 어떠한 경우에도 (a) 당사 소프트웨어와 관련된 거래로 인해, 그로 인해 또는 이와 관련하여 발생하는 손실 또는 손해의 전부 또는 일부 또는 (b) 직접, 간접, 특별, 결과적 또는 부수적 손해에 대해 개인 또는 단체에 대한 어떠한 책임도 지지 않습니다. Cryptohopper 소셜 트레이딩 플랫폼에서 제공되는 콘텐츠는 Cryptohopper 커뮤니티 회원이 생성한 것이며 Cryptohopper 또는 그것을 대신한 조언이나 추천으로 구성되지 않는다는 점에 유의하시기 바랍니다. 마켓플레이스에 표시된 수익은 향후 결과를 나타내지 않습니다. Cryptohopper의 서비스를 사용함으로써 귀하는 암호화폐 거래와 관련된 내재적 위험을 인정하고 수락하며 발생하는 모든 책임이나 손실로부터 Cryptohopper를 면책하는 데 동의합니다. 당사의 소프트웨어를 사용하거나 거래 활동에 참여하기 전에 당사의 서비스 약관 및 위험 공개 정책을 검토하고 이해하는 것이 필수적입니다. 특정 상황에 따른 맞춤형 조언은 법률 및 재무 전문가와 상담하시기 바랍니다.

©2017 - 2025 저작권: Cryptohopper™ - 판권 소유.