Most business leaders recognize the latent value of data. Yet many people think it takes 12-18 months to implement a cloud data warehouse. Why is that perspective so common?
“Warehouse” sounds big and intimidating. “Cloud” sounds new and fraught with risk as well as possibility. And there is no shortage of consultants and product vendors promising that their offering is the silver bullet to make all this easy.
Data can be complex. But the last couple of years have seen many innovations to accelerate and simplify this process, with no sign of slowing down.
The accelerated adoption of SaaS tools and products has caused business data to be scattered across many different silos. Growth of the SaaS industry is 17% year over year and accelerating. The average mid-sized company uses 30% more SaaS apps than they did last year, averaging 137 SaaS products used per company. That is 137 data silos. When data is spread out like that, it is easy to see why people think bringing all that data together for analysis is next to impossible.
Most companies have proprietary data sources, such as application databases, that are also part of the mix.
With this much complexity, how do you even begin to leverage data to answer questions about business operations and the customer journey?
Data is 15 years behind
A few years ago when I dove into data analytics, I came from a software background. The innovation in software tools and processes was truly impressive at the time (and still is). Stepping into the data world, I felt like I stepped back in time. Everything was done as if these new innovations did not exist. Source control, automated tests, continuous integration–all were foreign concepts.
Because tooling and processes for working with data were so archaic, the results were abysmal. Adding new data sources took weeks or months. Troubleshooting issues took deep expertise not just in data but in the specific implementation itself. Because changes took so long, the cost of mistakes was high–which just elongated the analysis time and time to deliver anything.
People and companies were stuck spinning their wheels, just waiting on results.
It is no wonder many people still feel like analytics efforts are beyond their grasp–historically, these efforts have been expensive, slow, and risky.
Innovations and the Modern Data Stack
Today, analytics tooling and processes are rapidly catching up, faster than I ever thought possible. The challengers are open source companies and companies built on cloud delivery models. A16z did a great job capturing this movement.
Generally analytics platforms are visualized left to right, in an ETL pattern (extract, transform, load).
A key switch in pattern is moving from ETL (extract-transform-load) to ELT (extract-load-transform). Although seemingly trivial, moving the data transformations to the end rather than in the middle allows better architectural principles, such as breaking down a complex system into smaller components.
For example, companies like XPlenty, Fivetran, and Stitch (and open source tools like Meltano, Airbyte, and Singer) can focus on extracting the data and loading it to a data lake–without attempting to modify the data. Removing the transformation responsibility greatly simplifies the problem, allowing them to focus.
At the center of the movement is data build tool (dbt Labs), which deals with transforming raw data into analytics-ready data models, capable of answering questions. Simple innovations like introducing software engineering principles of source control and templating are driving fast adoption of dbt.
Behind all of this is the cloud, particularly cheap storage and specialized analytical databases, such as Snowflake, Redshift, BigQuery, and Rockset.
With those major components in place, other specialized tools are appearing for things such as workflow and scheduling (Airflow, Prefect, Dagster) and data quality analysis (Great Expectations, Soda SQL), and data operations monitoring (Monte Carlo).
While none of these are quite plug and play, they follow the modern data stack architecture and integrate much better with each other than the tools from the past.
Tools are one thing, but improved processes are just as important. Applying agile principles to data efforts can turn what used to be a slow, painful process into one that delivers insights early and often.
In my experience, most people’s mental models are slower to change than the innovations in tools and products. Most organizations still approach analytics efforts as a monolithic project. Consulting companies and IT teams continue to love big projects (that are prone to fail by the way) because a big budget creates perceived stability and stature.
But analytics is a journey and follows a maturation process in any organization. At Datateer, we use a framework we call Progressive Analytics. It simply applies principles of agile (iterative delivery, reduce WIP) and lean (design for flow and pull, reduce waste) to deliver value early and often. This value is defined as using data to answer questions and assist decision making.
The first and most important principle is to think right to left, instead of left to right. Consider a simple visualization of an analytics system, where data flows from various sources are transformed into a data warehouse model, which is used to answer questions through dashboards or other analyses:
Most people implementing analytics think first about the data sources and how to move the data into the warehouse, then how to transform and analyze. This is left-to-right thinking, or “push” based design in lean terms. This thinking makes teams feel like they have to finish all the raw data loaders before they can move to cleaning or even think about combining data sources and transforming them, much less building dashboards.
Instead, think in terms of the questions that need answering, the audiences asking those questions, and the subject areas they pertain to. Then move to the left to design the data model only enough to answer those questions. By focusing on only a handful of questions, the amount of data modeling required is much smaller.
Then move left again to only pull in the data you need to supply the data model, to answer the questions.
By doing this, you can start to use the analytics in short order. By only focusing on the portions of the data model that are required, and only loading the fields from the source that are required, velocity is much higher, and you can iterate to expand the number of analyses and dashboards, and the number of data sources involved.
A business may need answers to hundreds of questions from its analytics platform. Taking on all these questions at once is a recipe for long projects and high risk. By starting with the questions that need answering, prioritizing and iterating become possibilities.
Delivering the answers to a few critical questions in a few days is far superior to trying to answer all the questions a year from now. Once those first questions are answered, the flywheel has started and moving onto the next highest priority questions can follow the iterative, progressive patterns now established.
In our experience, following the Progressive Analytics framework produces results in weeks. Thanks to the innovations in tooling and products, creating iterative processes to deliver analytics is now a possibility for any organization.