Start Creating
Data For AI

1/2 2/2

Want to talk to our sales team instead?

By logging in, you agree to our terms of service and privacy policy

Tech Notes

The Accidental Medallion Architecture

blog-details-banner

blog-details-userKoert Kuipers

blog-details-eye-slashFeb 8, 2023

blog-details-banner

our first (or maybe it was second…) client way back in 2011 wanted to create a customer 360 (total view of customer) dataset by combining many disparate datasets. the datasets had been acquired from different 3rd party vendors. they were very large datasets (think usa population size), had different schemas, different data formatting conventions, and there were no reliable common identifiers across datasets. all together it was close to a billion records.

we quickly realized we this was going to be a a major data integration challenge. at this point our software could at best be described as a glorified distributed etl engine (powered by hadoop map-reduce) with a few fancy tricks for semantic data integration (including a basic locality-sensitive hashing based approach to record linkage, because we knew this was going to be a major issue). so it wasn’t exactly a shrink-wrapped software suite for the problem at hand, but we could scale up (at that time this was a selling point by itself) and we had basic but working approach to the core issue that needed to be solved. so good enough!

we ended up with an approach to this challenge that could roughly be divided in to 5 stages:

  1. collect the data in a staging area on distributed storage. think of this as our data landing zone. we called this staging.
  2. profile the data to discover any major data issues (token values, inconsistent formatting, missing data, etc.).
  3. clean the data and map it to a single unified (sparse) schema. the output of this step we called prep (as in: the data has been prepped)
  4. run any joins and/or record linkage to properly integrate the data at a semantic level and enrich it as needed. we called the output of this… well output.
  5. run aggregations on the cleaned, standardized and linked data to create easily digestible output for further consumption. we called this master as in “master data” but today given the negative connotations associated with the word master perhaps we would call it da as in “data asset”.

since this first client we have refined this approach, made it more automated, more powerful, and turned it into proper software that encapsulates all the stages, but the essential data flow has remained mostly unchanged and has proven to be extremely resilient and adaptable. we use it today for widely varying use cases (customer 360, anti-money laundering and fraud are perhaps no surprise, but it also works for trade reconciliation and tracking customer orders).

recently i was browsing the internet a little reading about medallion architecture (see here or here). this certainly wasn’t the 1st time i heard of it, but looking at it again it struck me how similar it is to our approach.

the medallion architectures is a data design pattern where the data goes through 3 phases (bronze, silver, gold) and the idea is that the quality of the data improves with each step. you can read the details in the links provided above, but in a nutshell medallion architecture transforms consistes of these phases:

  • bronze: raw data
  • silver: cleansed and conformed data
  • gold: consumption ready data

so lets go through our process and map it to the medallion phases.

our step 1 clearly maps to bronze. basically we call bronze staging. yup it is the same thing.

now i think its pretty clear that our step 3 (clean the data and map it to a unified schema) belongs in silver. both mention data cleaning and unified schema sounds the same as conformed data. and also since we consider step 2 as a prerequisite for step 3 (without the profiling information cleaning is going to be very difficult) our step 2 also belongs here.

but what about joins and record linkage (our step 4)? the idea of silver in medallion is that the data has been processed just enough to make it usable, with speed and agility being prioritized over say project specific transformations. one could argue that this means our step 4 does not belong in silver, but i would disagree. data that has not been properly integrated (at a semantic level, so using the joins and record linkage) might seem useful but its not. so i think our step 4 squarely belongs in silver.

finally our step 5 which runs aggregations and produces easily consumable data. this seems a nice fit for gold.

done!

tresata delivers your medallion i guess!?