Why is it so difficult to keep it simple in Data Engineering?

Image sourced from: gettyimages

If data engineering had a slogan, it would be something like: “Simple in theory, complicated in practice.”

Over the years, I’ve noticed that the real complexity in data engineering often has less to do with data itself and more to do with how we think about building systems. Somewhere along the way, simplicity became unfashionable—and complexity started sounding smarter. You can see this in interviews, conferences, data engineering newsletters, blogs etc. I thought of writing my take on this, in a casual manner, and believe me I am writing this with a smile. You too should take it as a light read[For a change let’s forget about what AI is going to do to us in data engineering] and if you are an experienced data professional, hopefully this piece will bring a smile on your face too.

Here are a few patterns I keep seeing:

  1. We love jargons—and refuse to admit it: Every field has its vocabulary, but data engineering has turned jargon into a personality trait. We often describe fairly straightforward ideas using language so dense that even experienced engineers need a second read. And once something has a complex name, questioning it feels almost rude. Ironically, the moment a system needs heavy jargon to explain, it’s usually a sign that it isn’t as simple as we think.                                                                             
  2. Mention data modelling, get called a dinosaur: Talk about data modelling, and suddenly you’re seen as someone reminiscing about “the good old days.” Yet, no amount of distributed compute can save a poorly modelled dataset. Modern architectures may look different, but data still has relationships, constraints, and meaning—whether we acknowledge them or not. Ignoring modelling doesn’t make systems modern. It just makes problems show up later, usually during a critical release.                            
  3. Yes to new tools, no to understanding their behavior: New tools are exciting. They promise speed, scalability, and fewer headaches. But there’s a catch: we often adopt them without fully understanding how they behave under load, how they fail, or what assumptions they make. When something breaks, we blame the platform, the data, or sometimes even “unexpected edge cases.” Most of the time, the tool did exactly what it was designed to do—we just didn’t read past the first page of the documentation.
  1. The utopia of dev environments: Everything works perfectly in development. Pipelines run faster, data volumes are manageable, and edge cases politely stay hidden. Then production arrives—with real data, real users, and real consequences. The gap between dev and prod isn’t just technical; it’s emotional. Dev gives confidence. Prod gives humility.
  1. The TAJ-MAHAL syndrome: We don’t just want systems that work—we want systems that impress. Over-engineering often comes from good intentions: future-proofing, scalability, elegance. But not every pipeline needs to be a monument. Some just need to move data from point A to point B, reliably and understandably. Maintenance teams usually appreciate clarity more than architectural beauty.
  1. Expensive tools will fix our governance problems (right?): Data governance is critical, no doubt about it. But tooling alone doesn’t create governance. Without clear ownership, definitions, and accountability, even the most expensive solutions become fancy dashboards that nobody fully trusts. Governance starts with discipline and process; tools should support that—not replace it. There’s no shortcut here, unfortunately. If there were, someone would have already packaged it as a product.
  1. We forget that communication is still the key: At the heart of most data issues lies a simple problem: miscommunication. Between teams, between business and tech, or even within the same group—assumptions go undocumented, definitions drift, and context gets lost. No amount of automation can fix a misunderstanding that was never addressed. Clear conversations often prevent problems that no pipeline can later solve.

Simplicity in data engineering isn’t difficult because the problems are impossible. It’s difficult because simplicity requires restraint. It asks us to pause before adding another tool, another layer, another abstraction. It asks us to understand before we optimize, to model before we distribute, and to communicate before we automate.

There’s nothing wrong with modern architectures, shiny tools, or ambitious systems. They all have their place. But complexity should be a consequence of necessity—not a badge of intelligence.

In the end, the most reliable systems I’ve seen weren’t the loudest ones. They were the ones that were understandable. Predictable. Boring, even. And in production, “boring” is often the highest compliment. (If you have fixed broken pipelines on weekends, you know you crave for “boring”). The most resilient systems are often the ones that are understandable by more than just their creators. Similarly, the messiest projects I have been part of weren’t the most complex technically, but those where either there was unclear ownership, no legacy information, incomplete understanding of use-case or source system.

Keeping it simple isn’t about doing less. It’s about thinking more clearly. And that, perhaps, is real Engineering. So, the next time you design a pipeline, adopt a new tool, or review an architecture, pause for a moment and ask: Is this as simple as it can reasonably be? Not simplistic. Just clear.

And trust me, after few years running your pipelines in Production, you will thank yourself.

Disclaimer: Views presented here are my own and not related/ influenced by the company I work with

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top