0%

Data semantics: the missing layer of your data warehouse

2020년 8월 19일 6 분 읽기
뉴스 기사 배너 이미지

DRY & SQL

Often our first step in deriving value from data is to extract and summarize records within our Data Warehouse. Here the lingua franca is SQL. As a declarative language, SQL provides an extremely valuable layer of abstraction allowing us to describe what we want computed without knowing on which exact machine the data is stored nor thinking at all about how to orchestrate this computation. Let’s say we want to compute the number of BTC trades on Coinbase for a given asset month by month:

SELECT

COUNT(*)

, DATE_TRUNC(‘month’, timestamp)

FROM facts.trades

WHERE asset = ‘BTC’

GROUP BY 2

But wait, tomorrow I might need to investigate volume and this time on ETH:

SELECT

SUM(amount) AS volume

, DATE_TRUNC(‘month’, timestamp)

FROM facts.trades

WHERE asset = ‘ETH’

GROUP BY 2

Now we didn’t technically violate DRY, but let’s be honest, we did. The vast majority of the code written to answer these two questions is identical. In particular, we’ve duplicated our knowledge of the storage location and aggregate/filtering fields. And here we get to the core of the problem: Even in a well structured and well documented Data Warehouse, raw SQL queries force us to rewrite the underlying logic of our questions from scratch every time, producing heartbreaking inefficiency, inconsistent answers, and Data Scientists who spend more of their time as Data Librarians.

In this world, our Data Scientists are an embedded part of the machinery that enables us to derive value from our data and without them, we are lost. It looks something like this:

Data Semantics: a level above SQL

Now let’s chart a path out of this inefficiency & endless DRY-violation. At some point we need to write SQL to talk to our Data Warehouse, but this time let’s write it in a way so that it can answer our questions over and over again, even when they change a little bit. If we can manage that we’ll have made a quantum leap forward, making SQL the new bytecode. But we’re getting ahead of ourselves.

To get us started, the concepts we’ll need are:

  • Measures: aggregatable data records representing the measurement of some quantity, often we’ll sum, average or count the measures contained within individual records to investigate these quantities in aggregate. In the above query we summed up the “amount” measure to get the trading volume

  • Dimensions: non-aggregatable data records that we may want to use to filter or partition our analysis, giving our analysis the appropriate context. In the above query we filtered by the dimension “asset”

By adding this semantic metadata (measures & dimensions) to the underlying data powering our queries, we can codify which questions this data can answer and enable these questions to automatically be compiled into SQL (again, SQL as bytecode). Here’s an example of how that might look in YAML format.

compute: sqlstorage: facts.trades

measures:

— name: trade_volume

kind: sum

— name: trades

kind: count

dimensions:

— name: asset

kind: string

— name: timestamp

kind: time

Now instead of writing the full query I have a new input

measures: trades

dimensions: month

filters: [asset=BTC]

or

measures: trade_volume

dimensions: month

filters: [asset=ETH]

Admittedly, this looks a lot like a bastardized SQL dialect, and that’s not wrong. But we’ve abstracted away the important parts, namely the aggregation logic (count & sum for trades & volume, respectively) and the storage (facts.trades) so that I’m truly just expressing my core question in a structured way. At Coinbase, we’ve built a centralized API that understands these configurations and orchestrates any requested computation. This has paid huge dividends including:

  • Separation of Storage: while we’ve focused on SQL, it really doesn’t matter where this data is stored, so long as the records constitute the promised measures and dimensions

  • Separation of Compute: SQL can easily do counts, sums, etc. But SQL has no monopoly on these computations and other engines can be used as needed.

  • Single Source of Truth: with a centralized orchestrator managing this metadata we have a single source of truth for measure and dimension definitions enabling us to ensure consistent answers to our most important questions.

In this way we’ve changed the model of how we derive value from data, taking Data Scientists out of the machinery and instead enabling them to configure what the data they manage means:

For the sake of brevity, I’m leaving out a ton of the details here but the core concept is what we hope to get across. By codifying the semantic metadata that tells us how to use our various data sources, we can move away from the nuts and bolts process of manually specifying where data lives or how we want it aggregated. Instead, we can start asking our questions at a higher level (e.g. “BTC trades by month”) and let the semantic orchestration layer figure out where such data lives, and which engines should be spun up to aggregate it according to pre-configured methodology.

At Coinbase we have used these concepts to provide a great deal of automation to common Data Science tasks including Executive Reporting, Experimentation, Anomaly Detection, and Root Cause Analysis, all of which build upon the same idea of pre-configuring what the data means and relying on the orchestration layer to interpret this metadata to roll-up any desired quantities. By changing the Data Science workflow from an endless stream of DRY-violations to a coordinated, codified and centralized effort at maintaining a semantic interface, we are moving beyond ad-hoc SQL requests and towards a self-serve, single source of truth view of our business powering an ever expanding set of reliable answers to our most pressing questions.

If you are interested in solving complex technical challenges like this, Coinbase is hiring.

This website contains links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.

Unless otherwise noted, all images provided herein are by Coinbase.

was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

인기 뉴스

How to Set Up and Use Trust Wallet for Binance Smart Chain
#Bitcoin#Bitcoins#Config+2 더 많은 태그

How to Set Up and Use Trust Wallet for Binance Smart Chain

Your Essential Guide To Binance Leveraged Tokens

Your Essential Guide To Binance Leveraged Tokens

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)
#Subscriptions

How to Sell Your Bitcoin Into Cash on Binance (2021 Update)

What is Grid Trading? (A Crypto-Futures Guide)

What is Grid Trading? (A Crypto-Futures Guide)

Cryptohopper에서 무료로 거래를 시작하세요!

무료 사용 - 신용카드 필요 없음

시작하기
Cryptohopper appCryptohopper app

면책 조항: Cryptohopper는 규제 기관이 아닙니다. 암호화폐 봇 거래에는 상당한 위험이 수반되며 과거 실적이 미래 결과를 보장하지 않습니다. 제품 스크린샷에 표시된 수익은 설명용이며 과장된 것일 수 있습니다. 봇 거래는 충분한 지식이 있거나 자격을 갖춘 재무 고문의 조언을 구한 경우에만 참여하세요. Cryptohopper는 어떠한 경우에도 (a) 당사 소프트웨어와 관련된 거래로 인해, 그로 인해 또는 이와 관련하여 발생하는 손실 또는 손해의 전부 또는 일부 또는 (b) 직접, 간접, 특별, 결과적 또는 부수적 손해에 대해 개인 또는 단체에 대한 어떠한 책임도 지지 않습니다. Cryptohopper 소셜 트레이딩 플랫폼에서 제공되는 콘텐츠는 Cryptohopper 커뮤니티 회원이 생성한 것이며 Cryptohopper 또는 그것을 대신한 조언이나 추천으로 구성되지 않는다는 점에 유의하시기 바랍니다. 마켓플레이스에 표시된 수익은 향후 결과를 나타내지 않습니다. Cryptohopper의 서비스를 사용함으로써 귀하는 암호화폐 거래와 관련된 내재적 위험을 인정하고 수락하며 발생하는 모든 책임이나 손실로부터 Cryptohopper를 면책하는 데 동의합니다. 당사의 소프트웨어를 사용하거나 거래 활동에 참여하기 전에 당사의 서비스 약관 및 위험 공개 정책을 검토하고 이해하는 것이 필수적입니다. 특정 상황에 따른 맞춤형 조언은 법률 및 재무 전문가와 상담하시기 바랍니다.

©2017 - 2024 저작권: Cryptohopper™ - 판권 소유.