Knit picks: Apr 9
Hey Knit noodlers, hope you're hungry for a hot bowl of SQL! (mmm)
SQL Stuff
So far, Knit has mostly dealt with data as files. For example, DVC and Pachyderm, the two other Git for data orchestration products, only support working with data as files. But if we've learned anything from dbt, people want SQL databases!
Now Knit knows how to do both! Like dbt, Knit can manage dependencies between SQL queries that build up a set of tables. For now, postgres only.
File-based orchestration (DVC, Pachyderm) has focused on ML models, while SQL-based orchestration (dbt) has focused on data modeling. Knit hopes to satisfy both use cases, and also fill the gap where they intersect, like feature engineering (the step before building ML models) and data integration (pulling CSV and JSON from APIs and stuffing them into databases; what Fivetran does).
But wait, there's more! The postgres database server is also managed as a dependency. It's another example of a runtime environment as data. This time a postgres Docker container is a dependency of the SQL steps.
The net result is a single data flow command will set up the postgres server and create all the tables. If any of the tables are modified, or the Docker container is deleted, or a CSV or SQL query changes, only the minimal set of steps needed will be re-run.
What's going on? This is going to be a little roundabout...
Data structure stuff
What's your favorite data structure? Linked list? Nah, too slow. Multidimensional array? So square, literally. Binary search tree? Uhhh have you seen the code for red-black?
Merkle tree? Now we're talking!
The core idea is content addressing, which just means that data is identified by its unique hash1. When graphs are represented by content addressable storage, we sometimes call them Merkle trees (technically, they are DAGs). This is the fundamental data structure of Git and Bitcoin. And Knit!
Merkle trees are fundamentally immutable; objects cannot be modified because that would change their unique hash. To "change" the data structure, you actually have to create new objects. This is fantastic for Git, for example, because it makes it really easy to keep history, and it's fast and space efficient.
BUT have you ever tried to delete a password from Git? It's HARD.
So what is Knit supposed to do with data? I mean, sometimes you need to delete data (PHI, GDPR). And immutable data sounds like a lot of storage overhead, keeping all those versions.
This is part of the reason why Knit itself only stores metadata, like where the actual data is stored. All of the metadata is immutable, giving us most of the versioning goodness while the data itself need not be immutable.
So Knit works fine with any mutable data? Well, not quite. If data changes behind Knit's back, then it can return stale cached results and generally cause all sorts of confusion. Somewhere in between mutable and immutable data is what I call tamper-evident data. So if the data changes, Knit will know about it, and recompute what's needed. For postgres specifically, this is implemented with database triggers that keep an updated timestamp for each table.
I think if we can implement tamper evidence for most data storage systems, then Knit will be able to manage dependencies across both mutable and immutable data stores. Mutable data will save some space and should be compatible with pre-Knit data processing. Immutable data gets the full data versioning goodness.
I wave my hands here, but Knit pickers might say probabilistically unique hash. Content addressing today usually uses cryptographic hash functions because collisions are infeasible. (Although Git is currently undergoing a years long effort to move away from SHA-1 because it is no longer considered collision resistant. Thanks Google.)