For over a year now, I’ve been working with Dagster as my orchestrator of choice. Originally making the joint decision with my team to migrate workflows from our initial Airflow MVP, slowly ramping up in the new system and uncovering much of its hidden power, and then moving to a new company, tasked with single-handedly building the entire end-to-end data infrastructure.
Having seen first hand the impact that Dagster had on my team’s productivity, while knowing the platform in the new company would require end-to-end lineage, I made a decision that back then seemed risky, but already paid off in unexpected ways: I bet on Software-Defined Assets as the core abstraction for the entire platform.
Dagster defines assets as “an object in persistent storage, such as a table, file, or persisted machine learning model. A software-defined asset is a Dagster object that couples an asset to the function and upstream assets that are used to produce its contents.”
Doesn’t sound revolutionary at first, but in reading between the lines, Dagster’s SDAs are subtly flipping data orchestration on its head: rather than chaining tasks into a pipeline that will generate a persistent object, using SDAs means declaring persistent objects, and letting the pipelines and tasks be created in the background.
At the end of the day, you might be writing slightly less code, but the biggest change is a philosophical one: the way you’re thinking about your work changes, and this mindset shift has made me exponentially better at my work.
As data engineers, we’re often involved in the nitty-gritty. On any given day, I might be managing VPCs, scaling clusters, upgrading packages, enhancing CI/CD processes, or writing pipelines. Data consumers, however, are many steps removed from all of this underlying complexity. Before Dagster, writing pipelines meant thinking in classic ETL/ELT terms: What needs to be extracted from what, transformed in what fashion, and how it should be exposed to the end users. It takes many steps to go from the raw source data to anything that might be consumed, no matter if my data customer was a data scientist, an analytics engineer, or a business user (or, in a team of one, myself a day later). An error occurring in any of these steps could (and likely would) affect the final downstream product, no matter how much or how little the consumer knows about it.
A new age in data engineering
What if, instead, we stopped thinking about pipelines, and started thinking about their outputs? What if we stopped discussing task chaining and started discussing physical objects? What if we could put ourselves in our users’ shoes without any mental gymnastics, not only during the 50 minute meeting, but could see our entire work through that lens?
What if all of the complexity in chaining tasks was outsourced to a graph network, rather than having engineers try to fit a square peg into a round hole?
Imagine if, instead of frantically mapping tasks to data on-the-fly, we could have impromptu conversations with our data consumers and still leave them feeling understood, because the conversation reflects what we see in the codebase and GUI.
Data Engineering is a hard job. Through all the layoffs, there’s still plenty of roles to go around (at least in Germany, where I’m based) and companies struggling to fill them with good candidates. I’ve long said that the biggest difficulty in data is translating what is an inherently cultural challenge—filling gaps between siloed teams—into extremely technical solutions. The organizational impact of the best communicators with just enough engineering expertise will tower over the organizational impact of the best engineers with just enough communication skills.
But what if the tools we choose for our jobs can help oh-so-subtly nudge our view of the systems we’re constructing towards how our users see them?
What if developing a use case wasn’t about writing a new pipeline, creating tasks, and scheduling them with enough of a buffer to ensure that things always run smoothly? What if developing a use case could be about creating new physical objects? And what if these objects were clearly laid out and labeled in front of us, so we can always know what depends on what, and what impact an upstream change could have?
This might not help us write better code, but it surely would help us communicate better, and at the end of the day, that matters a whole lot more.