Jul 26, 2023
pretty much everyone i talk to these days agrees that cleaning, enriching and standardizing data is crucial in any data strategy. it is also expensive, and often unsuccessful.
why is it so hard? one reason is that it’s often tackled one record at a time. for example a single record without context is very hard to clean! is that corporate address correct? well who knows (i guess you can google it)! is that persons middle name missing or does that person not have a middle name? to do cleaning you need context. good context comes from aggregation (look at all the records you have for this person to decide about that middle name) and enrichment (bring in a 3rd party data set with high quality addresses to solve the corporate address issue).
this sounds good right? but i ignored the fact that finding all relevant records for aggregation is a hard problem in itself. how do you know what all the records are for this one person? are you missing some? did you select records for multiple persons instead of one? and enrichment suffers from a similar problem (its really just a more restrictive form of aggregation): that great third party data set won’t be delivered to your with your internal join keys in it, so integrating it means dealing with complex and ambiguous relations between your internal records and the external records.
a key piece of technology is missing here, and i believe its the knowledge graph (also known as the semantic network). to highlight its relationship to your existing (disparate) tabular data i like to call it the graph overlay. basically the idea is to use a graph to capture all the relationships between records in all internal and external data sets, and then derive context for aggregation and enrichment from that graph.
a graph overlay is critical for data integration, because nobody’s data is perfect or unambiguous and a graph can accurately capture these imperfect or ambiguous relations. and even if the data is perfect, the complexity of the relations within the data is often such that traditional joins (e.g. foreign keys) cannot capture it but a graph can.
creating a knowledge graph is about creating and inferring accurate relationships. factual relations can be extracted with some effort from existing join keys and other known relationships encoded in the tabular data. however the harder part is inferred relations (which can be probabilistic, think for example of relations such as same person, family member, etc.). this is hard because it’s a bit of a needle in a haystack problem: finding similar or related entities in a search space with millions or billions of records requires technologies like locality-sensitive hashing. and the problem should be solved without resorting to a brittle rule-based approach. instead the logic should be derived from a simple declarative description of the domain, and the execution automatically generated an optimized.
ok so now what do you do if you have this knowledge graph build? it provided a beautiful overlay for all your existing data that captures complex but vital relationships in your data. it includes both factual relations and inferred relations. it is full of interesting information! and it tends to be represented by a relatively simple table of relations (graph edges). perhaps the columns could be:
table1, pkey1, table2, pkey2, type, confidence
you might think at this point you are done right? you have a new table and it is a relatively simple one with just a few columns. now you can just load it into a database and incorporate it in your SQL queries like any other table. unfortunately this is where reality can be rather unpleasant, because relational databases and SQL are rather ill suited to leveraging this kind of data. instead of explaining this myself i will refer you to this excellent article written by the linked-in folks on the same topic : liquid part1 and liquid part2. part 2 explains the challenges with representing the graph in a relational database.
an alternative approach might be to kind of hack the knowledge graph into a SQL query representation without actually hosting the graph in a relational database. lets call this the “we serve SQL but do something entirely different in the kitchen” approach. we have tried this but with rather limited success. the main issue is that for this approach to be successful the SQL needs to valid and it should work with existing BI tools, and if you meet those criteria then the result tends to be rather limiting and disappointing. also this approach tends to focus on somewhat basic ad hoc graph traversals, which in our experience does not expose the most interesting structures or relationships in the data.
the approach we chose to use instead is to “flatten” the graph for specific use cases. what i mean by this is that we observe typical patterns of aggregation and turn them into identifiers (keys) that can be inserted back into traditional database tables (or at least trivially joined into them) and be used for filtering. such identifiers could for example represent “records for same person” or “all records related to a single order” (so think order, payment attempts, shipments, customer service interactions etc.).
one can think of this “flattening” as creating identifiers for clusters in the graph that represent real world useful entities. this approach abstracts and hides the complexity of the knowledge graph. since the graph algorithms that are used to build these clusters tend to me more powerful and can handle more complexity than graph traversals, the results can be high quality. and most importantly we have found these clusters to be much more valuable than traversals, because our interest is in aggregations, not point-to-point paths.
the graph algorithms deployed vary from the relatively simple connected components and graph clustering to complex optimal edge cuts, auction algorithms, label propagation, graph embeddings, etc. most of them will be pregel style, but not all. and all these algorithms have to be implemented on graphs that can have billions of nodes and tens of billions of edges. finally lets not forget that since our approach results in materializing new cluster identifiers for all records the clustering algorithm runs across all the data instead of supporting some kind of on demand query capability which only accesses a subset of the data as needed.
the results is high value added and simple to use. lets go back to the order example to illustrate. one can start with any record in any of the tables available, use the primary key of that record to lookup the cluster id (which in this case represents the order) and now use that cluster id to retrieve all relevant records across all tables for the order. so it’s a few lookups and some filtering, but no joins! it’s super fast and efficient and immediately gives you the entire story (full context across the entire organization) for the order. it’s worth pointing out that the query can start anywhere… it’s not like we need to go find the official record for the order in the orders table. the start could be a record for a custom service interaction that had a particular bad rating, and from there we get the full context of the order the call was related to. this is possible because all records are assigned cluster ids, no matter what table. and one can seamlessly switch context here: not interested in the order but instead in the customer? get the cluster id for the customer and go from there.
there is one last point i want to touch on here. there is a good reason we call the creation of knowledge graph cluster ids “flattening”: it turns something rich and complex into a simple identifier to filter on. so yes in this process information is lost. but the simplicity of the cluster ids is also the reason this approach is so successful: simple to use and simple to understand. it solves 80% of use cases and for remaining 20% we can run algorithms directly on the graph, which is more complex and powerful but also a lot more involved and requires more specialized skills.
Mirror Link – https://technotes.tresata.ai/missing-link/