I guarantee you have heard this phrase before. You may have uttered it yourself. This phrase kills analytics engineering efforts. At the moment, you do not realize the difficulties you just created for yourself. And when they hit, the moment is long gone, so identifying the root cause of your woes is nearly impossible.
Imagine the team during a planning session, trying to decide which data fields to bring in. The fields are not necessary to the questions needing answers, but they are related. After some debate, the session starts to go long. Everyone feels that they have spent enough time discussing something seemingly minor in the grand scheme of things.
“Go ahead and bring them in, we might need them”
Everyone agrees. And it seems wise to move on to more important matters. Your team is already going to be pulling some data out of whatever system is providing it. It would not take much work to pull in these extra fields–and certainly enough time has been wasted just discussing whether you actually need it.
This may seem simple and innocent, and it happens on most projects. You get into the technical details and forget about the big picture. You think you understand the blast radius of seemingly innocent moves, but complexity grows exponentially.
Have you ever heard of the term “data swamp”? Data goes into the data lake, never leaves, never moves. It gets stinky and scary, and no one wants to venture inside any longer.
“We might need it” is the most dangerous phrase in data analytics.
Whoever said “Mo Data, Mo Problems” must have been in that meeting when someone said, “we might need it.”
It overloads your people and systems
Users and analysts are confronted with much information they do not need. The mental load to understand data sets, data fields, relationships among tables, lineage, and intended purpose is real and large.
We promise easy to use, self-service analytics. For a department head with a day job, how much time do you think they have to dedicate to reading definitions of hundreds of data fields to find out what they want?
It increases mistakes
Referencing the poor department head above. She knows her business and domain, but translating that into which data fields and how to use them takes time and energy. The potential for misuse or even getting the analysis flat out wrong increases with the number of inputs she has to use and understand.
It increases costs and time
Simply put, there are costs associated with cleaning, processing, checking quality, etc of each and every data field. We underestimate these costs. More on this below.
It makes your system unnecessarily bigger and more complex
This gets into maintainability and performance. More moving parts means more things can break. This may be an oversimplification, but each data field carries a weight. Each drop of water in a bucket adds to the weight.
If “we might need it” only ever resulted in bringing in a single data field, maybe this wouldn’t be a problem. In reality, it is a slippery slope that is usually left unchecked.
Why does this happen?
The modern data stack enables it
I love the modern data stack. Everyone is writing about it. However, better tools do not always produce better results. Understanding how to use these tools to actually produce useful information is the key. Handing me a free commercial drivers license does not mean I am safe behind the wheel of a big rig.
Extraction has become easy, even starting to be commoditized. Cloud data warehouses are more capable than ever, as are an assortment of many related tools. It is to the point that you can check a few more boxes to pull over extra data fields.
“We might need it” is more prevalent than ever.
We go about it backwards
Every data flow diagram you have ever seen shows data flowing left to right. That influences how we work, and many, many projects work left to right.
How many times have you heard that the data scientists are just going to look for patterns in the data? Or the analyst wants to bring in everything and they will sort it out later.
By beginning with the end in mind, we can only bring in the data required, and iterate later if more fields are needed in the future. This is one of the cornerstones of Datateer’s Progressive Analytics framework
We don’t understand the requirements
Related to beginning with the end in mind, “We might need it” is a symptom of a lack of understanding of the requirements. Taking the time to understand the target audiences and their questions makes the question of which data fields to bring in much simpler. Needing a particular field or not can be determined by knowing if it is required to answer a particular question or set of questions.
Sometimes our data teams are structured such that engineers do not have exposure to requirements or lack details. Other times (maybe more often), we are lazy and do not do enough due diligence to get to specific questions that need to be answered. Broad or vague requirements like “the marketing department wants to answer questions about bounce rates” are the result of not digging deep enough.
We underestimate the work it takes to manage data
We ignore or underestimate the actual costs of time and computing around things like data cleaning, data quality checks, testing, and even mental strain of parsing and selecting fields we need for any given use case. When you need 10 fields and bring in 100, you have 10x effort and cost on your hands.
I have seen efforts to label those 10 fields as the “critical data elements” (or similar approaches), so that people of the future understand those 10 should have higher quality and attention. But I have yet to see where those efforts actually paid off. When someone sees the other 90, they want to be able to use them and assume a certain level of quality.
Databricks pushes the concept of bronze, silver, gold data sets, in part to help address the exponentially increasing number of data sets. But even a bronze data set has certain quality rules and must itself be managed. That management always takes time and attention.
Put some science on it
A mental model can help generalize what is going on here. System or organizational complexity is an area of research attempting to explain and help model complex systems. Given that in our modern world we have more tools, deeper specializations, and ever more data, it has become increasingly easy to make things more complex. Mental models help us understand and hopefully reduce complexity.
For some reason we humans tend to think that taking on more is always better. W. Ross Ashby explains that to maintain order, we have to meet complexity with complexity. In other words, keeping things simple does not allow us to meet the demands of an increasingly complex world. A marketer who can measure 25 things about 10 different segments of a customer base must do so just to keep up. Otherwise their competition will do so and be able to market more effectively.
Or so goes the logic. But that can become overwhelming quickly. To be successful in using data for driving business understanding, you must focus and target what you want to know. So there are limits to how far you can take Ashby’s law.
Niklas Luhmann was another influential researcher into complex systems. He contributed the theory of reduction of complexity. In an environment with virtually infinite information available, we must be selective of which pieces of information are most meaningful and how much energy we spend analyzing them. Choosing the right information–and focusing your energy analyzing it–yields the best results.
Combining these two competing laws results in a simple curve, with the peak representing the optimal state, or state of highest performance in a complex world.
- Ashby wants you to increase complexity to make sure you can handle the complex world around you
- Luhmann wants you to reduce complexity to be able to take action and reduce overload.
My personal observations indicate most people and organizations are way too far to the right. For some reason, we think more data, more complexity, more analyses are somehow always better. The result is most organizations waste enormous amounts of energy and time in the minutiae without producing results.
What to do about it
Easier said than done, but the solution is simple.
First, everyone must agree on a crawl-walk-run mentality as a guiding principle. In my experience, business executives immediately buy into this idea. Especially if they have past experience with overdue technology projects. They prefer good enough immediately over perfection later. Leverage them to help evangelize the crawl-walk run guiding principle. Reference this principle when you run into hard decisions.
Second, label the behavior for what it is: “we might need it” is not prudent, it is lazy. It is a shortcut or attempt to cover up a lack of clarity around requirements or goals, a lack of understanding of the intended audiences, or a lack of understanding of the business operations.
From there, either “we definitely need it,” or defer until you do.
- De Toni A. F., Pessot E., Candussio F., De Zan G. Climbing the Complexity Hill: An Operations Management Case Study. European Conference on Complex Systems 2016