A good analytics function in a business requires a combination of deep domain expertise, technical acumen, and data experience. Unfortunately, most of us lack at least one of those.
This article is for leaders responsible for analytics in your organization, especially if you lack deep experience in technology or data. Inevitably you will have to make product and technology decisions. I want you to have a good mental model for modern analytics and a solid vocabulary to participate in conversations with vendors, consultants, data teams, and executives.
The most prevalent and modern method of building an analytics platform nowadays is known as the “Modern Data Stack.”
Analytics platform: a data solution that provides tools and technology from the beginning to the end of the life of the data. It includes data retrieval, storage, analytics, management, and visualization.
- What is the Modern Data Stack?
- How to Visualize the Modern Data Stack
- Components of the Modern Data Stack
- Important Concepts
What is the Modern Data Stack?
Modern Data Stack, the Marketing Catchphrase
If you are just starting to explore how to put together a good analytics platform for your business, it is extremely easy to get caught up in all the marketing content and get very confused along the way.
This is an old story: innovation happens, and the good information gets lost in the noise of a bunch of marketing content.
There is a lot of innovation and a lot of new products. What tends to happen is as you start to select products or tools, those companies have a network of partners and will direct you to use their partner companies. It’s not necessarily a bad thing, but ask the question of why they partner with certain companies and why they passed on others.
Things that are important to you will stick out, and it will help you build out a list of possible alternative products without getting sucked into the hype.
Modern Data Stack, the Technical Blueprint
A technical architecture is a blueprint for how a system can be built or put together. I’ll talk below about the particulars of this blueprint. At a high level, the Modern Data Stack is:
- Best of breed (vs. a single product that does it all)
- Centered on a data warehouse
- Follows an ELT pattern
Cloud-native: a flexible way to store and structure data in a scalable way (allows for growth) and can be accessed quickly from anywhere. More than just an architecture for data, it’s an entire fully automated ecosystem for interacting with data.
Because the Modern Data Stack is a pseudo-standard in the industry, implementers and product vendors all understand this general blueprint. What they create generally fits with the standard blueprint, making things pluggable.
But because this is not a true industry standard, levels of integration and pluggability vary. Sometimes painfully so.
If you have been an owner or user of data warehouses in the past, be ready: operations are generally looser, the pace of change is (or can be) much faster, and containing cloud costs is super important.
But the benefits of this blueprint are proving fruitful and cost effective, enabling tiny companies to have a solid analytics platform and big companies to scale to the level of their needs.
Modern Data Stack, the Movement
When hiring data people, one thing I look for is understanding and passion about the Modern Data Stack. I remember when the light bulb went off for me. It was the first time I saw software engineering best practices applied to data–a novel concept even today. It was eye-opening, and the possibilities were exhilarating.
From there, it was a journey through dozens of Slack communities (each with thousands of members), reading opinion articles on how the Modern Data Stack should work, new conferences dedicated to pieces of the Modern Data Stack, and trying to keep up with the hundreds of companies receiving VC investment and promising to be the silver bullet to all the shortcomings of all data everywhere.
Point being, there are a lot of newbies and a lot of excitement. But this is still a relatively new concept, so there is a lot of inexperience and mistakes. There is also a shortage of reliable talent with sufficient experience, and it’s getting worse. EY notes an increase of 50% in 2022 of executives focusing on data initiatives and increasing hiring.
Maybe the shortage will resolve itself as the tens of thousands of junior folks upskill and gain experience. But for the foreseeable future, demand is outpacing supply.
How to Visualize the Modern Data Stack
I like to visualize the Modern Data Stack as data moving in a left-to-right flow. The phrase “data pipeline” is often used, conjuring up images of oil pipelines and refineries. It’s a good analogy of the refinement process that data goes through as it progresses through these components.
I’ll go into details on each of these components in a moment. For now, here is the overview:
On the far left are your data sources–anything that produces data that you need to analyze.
Next, we have the data lake, which is a landing zone for collecting raw data from the data sources.
From the lake, data is transformed in the data warehouse (although that’s not quite clear in the image above).
Various processes (orchestration, observability, quality, support) plug into the warehouse in a general concept known collectively as DataOps (short for data operations).
Once the data in the warehouse is transformed into an analytical data model, it is consumed by various audiences through data products appropriate for their needs. The most common data products companies start with are dashboards or direct connections from spreadsheets.
Components of the Modern Data Stack
Let’s go into more detail about these components. Something to keep in mind as you read through these–you don’t have to build these yourself. In fact, that is a poor practice with low ROI. For example, an intuitive approach is to build your own data connectors–”how hard could it be?”
The best ROI is through combining specialized products and tools into a cohesive analytics platform.
Data sources refer to anything that houses data you want to include in your analyses or metrics. They include things like:
- SaaS applications like a CRM
- ERP systems
- In-house databases created by IT
- Public data sources like census or weather data
- Data brokers or providers who sell data
- Spreadsheets that are manually maintained but critical to your operations
Connectors are software tools that know how to do two things: extract data from sources and load data into destinations (typically the lake or warehouse).
This is a deceivingly complicated situation. First off, each connector is unique because each data source is unique. Imagine trying to create and maintain data connectors for all the thousands of SaaS products that are out there.
Next, imagine trying to ensure you can support all the possible destinations.
It could be a nightmare. It is the single biggest challenge for vendors that specialize in data connectors.
The data lake concept has been maligned and in some circles, fallen out of favor completely. But I don’t see it as optional. It serves three main purposes:
- A landing zone for collecting data from sources. Modern warehouses can perform this function–usually. If the connector you are using is robust at loading data directly into the warehouse, go for it! However, pushing data into a data lake is simpler, so fewer things can go wrong. The data lake also gives you a place to perform basic cleansing and preparation on the raw data.
- More importantly but not as immediately obvious, the lake can be a place where data scientists can get direct access to raw data, where the original copy is guaranteed to be untouched and unmodified. Just having data from multiple sources in one place is a big benefit for data scientists or analysts who otherwise would have to do that raw collection themselves.
- As you get into your analytics platform, you will quickly realize that the warehouse is the center of gravity and the biggest cost. The market will soon be dominated by a few large players who will continue to attempt to monetize you even more. Separating the lake functions from the warehouse will reduce vendor lock-in.
The data warehouse is the center of the universe in the Modern Data Stack. At its core, it is just a database.
- They support the SQL language, which is the ubiquitous way of querying data in databases.
- Tools that can connect to other databases usually can also connect to data warehouses.
- Data is presented and interacted with in the well-established paradigms of tables, columns, and views.
An important distinction is that warehouses targeting the Modern Data Stack are cloud-native. They are designed for scale and priced on consumption. They can grow and grow to an unlimited scale (okay yes at some point there is a limit. But you aren’t Amazon or Walmart, right?).
I’ll bring up cost here because warehouses are almost always the main cost driver of your entire analytics platform. If you are familiar with SaaS products that typically have a monthly subscription, consumption pricing is next level. You are charged by the computing resources your operations consume, metered by the hour or sometimes down to the sub-second.
The best analogy is your electricity bill. The more lights you turn on and devices you plug in, the more electricity you use, and the higher your bill will be.
So although you are in control, there is no ceiling, and you are responsible for monitoring costs more closely than you may be accustomed to doing.
Orchestration refers to getting all these tools and products to play nicely together. The analogy is a literal music orchestra, with a conductor at the front keeping time and directing the flow.
Good tools and products exist for orchestration, but I have found this is an area where my teams spend significant engineering work.
Orchestration products provide the logic flow of what should happen and when, schedules or triggers to run programs that process data in the pipelines, and capture errors for triage and troubleshooting.
Ultimately your audiences–whether they be employees, customers, or some other group–do not care about the technical underpinnings. They want access to the data. Data products are any asset you manage that allows your audiences to use data. Typical examples include:
- Data Apps
- Datasets you expose directly to consumers (e.g. a view in a database could be a data product if you let people use it directly, but not all views in the database are data products)
As you go on your analytics journey, data products will proliferate. It’s natural. But only 20-50% of the data products you produce and manage will be valuable. That’s also just inevitable. Too many organizations treat them like permanent fixtures when they should be purging or redesigning.
It is also inevitable that in a living system, these data products must be maintained, or they will gradually decline in value. The thing to remember here is to treat them like assets and actively manage them.
Any journey in analytics is a crawl-walk-run journey. Be wary of anyone who tries to sell you otherwise. Implementing the core components from above will get you huge value, even if some of the benefits of the advanced components are not in place.
Artificial Intelligence or Machine Learning
Often abbreviated AI/ML, this simply means applying statistical methods and models to the data in your warehouse. For example, a popular application is applying linear algebra to historical data to predict future trends.
Most leaders recognize the power that predictive analytics can have on their business. So, if there is a shortage in data talent in general, there is an extreme shortage in data scientists–people who have deep experience in a combination of your specific business domain, programming, working with data, and statistics.
What is exciting is the number of vendors pursuing “low code” or “no code” solutions for AI/ML. While they will never surpass the capabilities of a true data scientist, they certainly deliver a lot of value for a much lower price tag.
Governance answers the all-important question in business data, “Who can see what?” In other words:
- what level of access does each user role have
- what are you using for version control to keep track of modifications and access
- what security certifications are required for your business data to keep your (and your clients’) data safe
Data Observability and Data Quality
I lump these together because most products that focus here do a mix of both. Data quality is the goal, with observability being a mechanism to get there.
Data quality is simply that the data can be trusted, is timely, is accurate, complete, etc. There are no industry standards for this. But it is a nice concept to use when audiences begin to mistrust data for a variety of reasons. Framing those conversations as data quality conversations can align everyone’s thinking.
Data observability is the concept of being able to see into the data pipelines and understand what is going on. This is especially helpful when there are errors or problems, but it is also useful for things like automatically detecting statistical differences in data.
For example, if revenue spikes 100% overnight, that most likely is a data quality problem with a root cause that can be found through data observability tooling.
One of the most valuable things a company can do with data is create insights for their customers as the audience. A popular method to do this is embedded analytics, which means creating a visualization or dashboard in a specialized tool, and then exposing that visualization or dashboard into a SaaS application so that end users can see it.
This is not overly complicated but does require collaboration between product management, application engineering, and data teams.
Reverse ETL or Operational Analytics
These are fancy-sounding terms that mean something quite simple. Once data is collected transformed, and made valuable in a warehouse, why not push it back out to the operational systems?
In this component, the data sources can become data destinations. Imagine gathering data about customers and orders from CRM, billing systems, customer support tickets, etc and creating a nice 360-degree picture of each customer. Wouldn’t it be nice to push that information back into the CRM, where the sales team lives and breathes?
Most analytics platforms at mid-sized companies are batch oriented. Daily data updates are the most prevalent pattern, but even updates every 10 minutes are considered batch updates–they are just smaller batches.
Batch: a system relied on for processing and analyzing vast quantities of data simultaneously. Applies to companies that hold their data in store for a period of time, as opposed to streaming data.
Streaming is a different animal and is required for true real-time or near-real-time operations. In general, the tooling for streaming is quite mature–but not for analytics platforms. Robust streaming tooling has been available to applications and operations like Internet of Things for a while and, in the coming years, analytics tooling will catch up.
Here are a few operational concepts and design patterns that you should be aware of even in an executive or completely non-technical role. You almost certainly will be asked to weigh in with your opinion to give guidance to the technical folks.
ELT vs. ETL
ETL stands for “extract, transform, load” and is a pattern that has been in place for decades. It is a data pipeline pattern that means extracting data from sources, transforming data, and loading data into a destination like a warehouse.
The problem with this pattern is it lumps the business logic of the data transformations together with the technical plumbing of moving data around. This may seem trivial, but it isn’t.
By switching the order of operations–ELT or “extract, load, transform”–we can separate the plumbing of extract and load from the business logic of transform.
This leads to a much, much more flexible system and you should have a strong opinion here!
Combining Data Across Sources
If all you need to do is analyze data from a single data source, the Modern Data Stack is not the right solution for your business.
The true power comes from bringing data together from multiple sources and doing analyses only made possible by being able to look across an entire business.
I point this out because a common mistake is to expose raw data as a Data Product. If people get comfortable with the raw data, they will push back when you try to move them to use combined, transformed data.
Worse, if your CRM tracks a real-world human by calling them a “contact,” but your billing system calls a human a “person,” then we have a terminology problem when it comes to analytics. Data efforts are already complicated enough, and terminology can be death by a thousand cuts.
Analytical Data Modeling
An analytical data model is an opinionated view of your business, made up of transformed data from multiple sources and designed for analysis at a large scale. I’ll break that down:
An opinionated view means you’ve reconciled terminology discrepancies and found (or forced) alignment on metrics and their definitions.
Data from multiple sources is not only brought together but converted into this opinionated view and combined. I.e. the “contact” and “person” records from above don’t live in different places in the warehouse but are merged together.
Designed for analysis means the structure of the tables and columns is designed to handle queries on large data sets, such as a metric that calculates average order size across your entire order history for the entire company.
Scaling means growing or shrinking your system to support the workload at hand. Any cloud-based product vendor will tell you scaling is a solved problem. It isn’t, but the mechanics of scaling are way better than in the past.
If you are familiar with cloud platforms, you will be familiar with scaling. If not, here is a primer: before cloud, you had to run your systems on physical servers that you purchased. Buy a server that is too small, and your system (e.g., such as your website) might crash from too many users and too much load. So if you decide to buy a huge server guaranteed to support your users–you have overspent on capacity that you may never need.
In the cloud, scaling is configurable. Sometimes you can manually decide how powerful to make a component; sometimes, it is automatic. In all cases, you pay for it, so tuning is required.
You will be responsible for an analytics platform that is made up of many components, each with its own scaling approaches and limitations. Inevitably you will need an opinion on the tradeoffs between cost and high scale.
Emergent Systems and Iterative Implementation
Let me cut to the chase–you will not get all this right the first time. Historically, business intelligence or analytics efforts were huge projects that did not release anything out to the intended audiences for 12-18 months or longer.
We do not live in that world because the Modern Data Stack and its associated tooling embrace fast change and iterative development. The result is business value earlier and ultimately more business value because the system is aligned with the needs and opportunities.
However, the gravity of doing a big project is still very real. It manifests itself in many ways, such as phrases like “Go ahead and connect to data source X, we might need it” to “We can’t start building dashboards until the data model is done.”
Is the Modern Data Stack the only way to go? Certainly not, but it by far has the most traction compared to other approaches. The speed of development, shortened time to value, best-of-breed selection, flexibility, scalability, and usability of the Modern Data Stack results in a winning combination.
If you want a partner to help you navigate and manage your analytics journey, from crawl all the way to run, let’s talk about Datateer’s Managed Analytics Platform. Or if you want to learn more about this type of service, check out the Guide to Managed Analytics.
In future articles, I’ll discuss antipatterns to avoid, the Analytics Operating System to help you put the right processes on top of the Modern Data Stack architecture, and dig into benefits and drawbacks.