Koert Kuipers
Feb 8, 2023
our first (or maybe it was second…) client way back in 2011 wanted to create a customer 360 (total view of customer) dataset by combining many disparate datasets. the datasets had been acquired from different 3rd party vendors. they were very large datasets (think usa population size), had different schemas, different data formatting conventions, and there were no reliable common identifiers across datasets. all together it was close to a billion records.
we quickly realized we this was going to be a a major data integration challenge. at this point our software could at best be described as a glorified distributed etl engine (powered by hadoop map-reduce) with a few fancy tricks for semantic data integration (including a basic locality-sensitive hashing based approach to record linkage, because we knew this was going to be a major issue). so it wasn’t exactly a shrink-wrapped software suite for the problem at hand, but we could scale up (at that time this was a selling point by itself) and we had basic but working approach to the core issue that needed to be solved. so good enough!
we ended up with an approach to this challenge that could roughly be divided in to 5 stages:
staging
.prep
(as in: the data has been prepped)output
.master
as in “master data” but today given the negative connotations associated with the word master perhaps we would call it da
as in “data asset”.since this first client we have refined this approach, made it more automated, more powerful, and turned it into proper software that encapsulates all the stages, but the essential data flow has remained mostly unchanged and has proven to be extremely resilient and adaptable. we use it today for widely varying use cases (customer 360, anti-money laundering and fraud are perhaps no surprise, but it also works for trade reconciliation and tracking customer orders).
recently i was browsing the internet a little reading about medallion architecture (see here or here). this certainly wasn’t the 1st time i heard of it, but looking at it again it struck me how similar it is to our approach.
the medallion architectures is a data design pattern where the data goes through 3 phases (bronze, silver, gold) and the idea is that the quality of the data improves with each step. you can read the details in the links provided above, but in a nutshell medallion architecture transforms consistes of these phases:
bronze
: raw datasilver
: cleansed and conformed datagold
: consumption ready dataso lets go through our process and map it to the medallion phases.
our step 1 clearly maps to bronze
. basically we call bronze
staging
. yup it is the same thing.
now i think its pretty clear that our step 3 (clean the data and map it to a unified schema) belongs in silver
. both mention data cleaning and unified schema sounds the same as conformed data. and also since we consider step 2 as a prerequisite for step 3 (without the profiling information cleaning is going to be very difficult) our step 2 also belongs here.
but what about joins and record linkage (our step 4)? the idea of silver
in medallion is that the data has been processed just enough to make it usable, with speed and agility being prioritized over say project specific transformations. one could argue that this means our step 4 does not belong in silver, but i would disagree. data that has not been properly integrated (at a semantic level, so using the joins and record linkage) might seem useful but its not. so i think our step 4 squarely belongs in silver.
finally our step 5 which runs aggregations and produces easily consumable data. this seems a nice fit for gold
.
done!
tresata delivers your medallion i guess!?
Mirror Link – https://technotes.tresata.ai/medallion-architecture/