Intro to Progressive Analytics 3
Progressive Analytics

Intro to Progressive Analytics #3: Metrics & Delivery

Progressive Analytics is a framework that strives to achieve value from analytics quickly and sustainably. It is designed on principles of agile and lean, tailored to the modern data architecture, and born from real-world experience delivering analytics platforms.

One thing a solid framework helps with is when things get off track. Without a delivery framework that is modular and incremental, it is difficult to point precisely to where things are going wrong. 

This post gets to the question: what we are building? Earlier posts focused on why are we doing this? and where and how are we getting our data?

It is way too common and easy to jump from a whiteboard or requirements session into implementation. Although data architecture is often planned well, operations does not get as much attention. Worse, metrics management often never gets any attention, nor does a plan for things like report or dashboard governance

How many times have your audience members questioned the implementation of a metric (e.g. “I thought we were going to exclude canceled memberships in this number?”) or lost track of where to find answers (“Where was that dashboard again?”).

From my experience, too little time is spent defining a clear plan for delivery and management of metrics. This layer of Progressive Analytics can be iterative too. Start lightweight and as traction and size of the analytics program grows, the delivery and metrics management can grow as well.

Delivery

Delivery refers to how we deliver analytics to our Audiences. This is the “last mile” in delivering analytics to people who need them. Lots to think about in here–but again, this can be iterative. I list some things below that we discuss and think about. Some iterations this is a 5 minute discussion, other times it takes longer.

Data Products

Originally we called these delivery vehicles. A vehicle is “a medium through which something is expressed, achieved, or displayed.” I liked the analogy of the mail or delivery truck taking a package the last mile to the final destination. Although Data Product does not have a consistent definition, the industry seems to be converging on this term.

Examples of Data Products include:

  • Internal dashboards
  • Customer-facing dashboards embedded in a SaaS app or web site
  • Datasets surfaced in a data catalog as part of a data mesh
  • Independent, embedded visuals in a mobile or SaaS app
  • Direct database connections for statistical tools, or R or Python connections
  • Downloads into spreadsheets for analysts
  • Data apps in Streamlit or Plotly Dash
  • Downstream data pipelines for other teams or companies to consume

Defining Audiences is a prerequisite and input into defining Data Products. 

Prioritization

Data Products are often implied and very clear to certain people driving the effort, but easily misunderstood to the broader set of stakeholders. Even simple parameters–like whether a dashboard is internal or external–can easily be assumed by someone too close to the work. Clear and explicit discussions help align all stakeholders.

If your plan calls for multiple Data Products, prioritize and draw a line to limit the amount of work in progress. Dealing with a single or limited set of Data Products at a time in any one iteration will increase delivery velocity.

Governance

Much could be said about report and dashboard governance, and much has been said, and I will not attempt to cover everything here. But data governance often receives much more attention than report governance. 

Defining clear owners of each Data Product is the first step to make sure they stay accurate and maintained.

Defining Audiences and Questions as an input to this usually makes security and permissions of Data Products straightforward. 

Lifecycle

Versioning dashboards, datasets, data models etc is not common in analytics platforms–but could be and brings a lot of value when needed. 

Software development and product management have strong concepts of lifecycle–i.e. design, development, test, version, deploy, rollback.Not so much when it comes to Data Products. I expect this will change over time as data product management matures. 

How will you make updates to a dashboard that is live and used by dozens or hundreds of people each day? That plan might be different compared with a quarterly PDF report that gets manually emailed to the board of directors. 

Metrics

Surprisingly, Metrics management is a missing piece of modern data platforms. This is not a problem until you get past the first few iterations, when it becomes impossible to remember all the metrics’ definitions and implementations. 

Metrics are the atomic unit of analytics. The business doesn’t care to think in terms of quarks or gluons. They want to answer questions through analysis of metrics, trends on those metrics, and combinations of those metrics. 

All too often, metrics are implemented in a BI tool. This is a bad practice, because any other Data Product either has to be based on that BI tool, or you must re-implement the metric logic in multiple locations. This is a recipe for inconsistent metrics, leading to the Audiences losing trust in the data platform as a whole. This goes against all the reasons we have centralized data warehouses in the first place. 

Regardless of where they are implemented, metrics need a name, a business definition, a technical definition, and an owner. We track these in a simple spreadsheet, but it is so easy to let it gather dust. So, in conversations with Audience members or analysts about metrics definitions, we refer to the spreadsheet. In conversations with engineers about technical implementations, we refer to the spreadsheet. 

Working through disputes to get to correct, consistently defined metrics is essential to the work we do in creating data platforms. Impasses in these disputes mean either 1) we are actually referring to two different metrics, or 2) that the business needs to get consistent on terminology and operations.

We are considering versioning metrics definitions, and giving them a simple lifecycle like proposed, in progress, and delivered. We have not acted on this yet, but it would help clear up confusion as metrics evolve. Startups like Transform and Supergrain are attempting to address this, but they are very early. I have high hopes!

See how the Delivery & Metrics layer fits into the broader Progressive Analytics framework by exploring these related articles:

Photo by Irina Shishkina on Unsplash

Read More
Intro to Progressive Analytics 2
Progressive Analytics

Intro to Progressive Analytics #2: Sources and Pipelines

Data is more siloed than ever before. Three quarters of companies use all or mostly SaaS technologies to run their business, and the average mid-sized company uses dozens of SaaS applications to run their business. Add to that internal databases and all the spreadsheets used to make decisions. To say data is siloed is an understatement. 

Most companies struggle with analytics, unsurprisingly. Data scattered everywhere makes it easy to get disorganized quickly. 80% of analytics efforts never deliver business value, and a big reason is the complexity of so many sources. 

Many commercial and open source tools have risen to extract data from these sources. The largest claim to be able to extract data from over 200 SaaS products. Meanwhile, estimates for the total number of SaaS products on the market range from 10,000 and up. 

Progressive Analytics is a framework that strives to achieve value from analytics quickly and sustainably. In this post I discuss Sources & Pipelines. I discussed in an earlier article the benefits of a simple, proven framework to follow. Progressive Analytics is designed on principles of agile and lean, tailored to the modern data architecture. In this article we talk about the bottom layer, Pipelines and Sources.

Layer Four: Pipelines & Sources

Data Pipeline simply refers to something that moves data from one place to another. Many ways exist to implement a pipeline, some of which offer more flexibility, reliability, observability, etc than others. 

A Data Source is a convenient, natural hierarchy to precisely refer to where data is coming from. In Progressive Analytics we use this hierarchy:

LevelDescriptionExamples
ProviderThe organization that owns or provides the data to your organizationGoogle
SystemThe product, API, or system that provides the dataGoogle AdsGoogle AnalyticsGoogle Sheets
FeedMost often, systems provide multiple groupings of data. APIs provide separate endpoints for various resources, databases have separate tables, streams have topics, etc. Google Ads AccountGoogle Ads CampaignGoogle Ads Ad GroupGoogle Ads Ad 
ObjectWhether batch or streamed, individual data records are always grouped/aggregated and stored in some sort of file or object after they are extracted and processed. Over time, a Feed will have more than one Object.ad_extract_20210801.csv
campaign_extract_20210801.csv
RecordA group of pre-determined fields. Although the structure may vary, typically it is fixed or mostly known in advance.An individual row in a CSV file
An individual record in a JSON file
FieldAn individual attribute or property of a record. A group of fields are what make up a record.ad.statusad.labelad.IDad.title

These levels provide a nice abstraction that applies to any data from any source, and a simple way to understand the amount of work. 

A sidenote on structured vs unstructured vs sem-istructured data. Most business analytics are driven by structured data, and most operational systems and SaaS products can provide structured data. Progressive Analytics (as well as most analytics efforts) assumes you are dealing with structured or mostly structured data, because a human can rationalize and analyze that data. 

Whiteboard vs. Keyboard

In an earlier post I discussed the top layer, Audiences & Questions. It is important that scope and outcomes from this layer drive the Pipelines & Sources layer from the beginning, and that both workstreams happen in parallel.

Unfortunately, high-level, strategic conversations about an analytics effort are too far removed from the actual technical implementation. We call this whiteboard vs keyboard. Business context and nuance that are well understood during planning conversations are too often lost by the time the engineer sits down to configure a data pipeline or transformation. 

What is clear on a whiteboard sketch, like this one:

Is easy to lose sight of at the keyboard, because they are at such different levels:

Aligning the activities of the whiteboard and keyboard is why we advocate two parallel workstreams–one top-down, and the other bottom-up. Additionally, thin-slicing the amount of work to be done allows anyone, regardless of their level of operation, to keep all the relevant context and goals in mind.

When the conversation leads to adding just one more provider, feed, or even field, you must push hard to defer that work to a future iteration unless absolutely necessary.

Getting Organized

Below are the activities in a typical iteration. 

Scope

The data sources in scope for any iteration should be only the data necessary to answer the in-scope questions identified in the top-down workstream. Defer all other providers, systems, feeds, etc until later.

Inventory

A complete inventory includes going through the levels Provider, System, and Feed.

Access

Gaining access is a technical exercise working to create API keys, firewall rules, or service accounts necessary to get at the data.

Flow

Above all else in the bottom-up workstream, get the data flowing into your data lake. Many pitfalls and gotchas might exist connecting to the source System and extracting data from it. Until the data is flowing from the source System into a data lake or some other storage location under your control, you have no way of knowing what stumbling blocks you may run into.

Remember in the ELT architecture, no transformation happens in this step. This keeps things simple to avoid adding yet another complexity to an already complicated activity.

Assess

Many exercises and tools exist to analyze, profile, and understand the characteristics of data. Some analysis is appropriate directly in the source System, such as understanding what Objects the System provides and what Fields are available in each Object. Most analysis is appropriate after data is under your control, for example:

  • Do we have all the Fields we need to answer the Questions?
  • Is the data complete or are there blank or null values?
  • Are the patterns in the data expected?
  • How prevalent are anomalies in the data?

Iterate

Future iterations (driven by new Questions or Metrics!) will entail adding Fields, Feeds, even new Providers, etc. By defining an iteration in terms of the Data Source hierarchy, you will be very clear on how much work is to be done and how long it will take.

Pipelines

A data Pipeline refers to the operations that move data. It is an automated, executable process that can be scheduled or triggered. Many techniques exist: homegrown, commercial products, and open source tools. Broadly, any system dealing with data pipelines will entail technology to cover the connections to the source systems, the actual movement of data, the storage once data is under your control, scheduling, and observability. 

The scope of this post is not to analyze the technical options, but to acknowledge that this work is organized and performed at the lowest level of the Progressive Analytics framework, Sources & Pipelines. Datateer provides a Managed Analytics Platform, where all this is pre-built, packaged, and ready to go from day one. If you are starting from scratch, you may have some foundational work before diving into Progressive Analytics such as architecture design, process definition, vendor selection, and integration among tools.

See how Sources & Pipelines fit into the broader Progressive Analytics framework by exploring these related articles:

Read More
Intro to Progressive Analytics 1
Progressive Analytics

Intro to Progressive Analytics #1: Audiences and Questions

We have been working on formalizing our approach at Datateer. From our very first customers, the feedback we received was that we were doing things quite differently from what people had traditionally experienced in analytics efforts. My background in software engineering and building products was behind how we were operating. Many principles of agile and lean are naturally baked into the framework we follow. 

We want to use data to answer questions, quickly and reliably. This is the power of data–making or informing decisions, faster and more accurately. By following a framework, we move fast, deliver iteratively, and keep simple things simple, so we can focus energy on the hard things.

This article is to introduce some of the concepts we use, starting with the four layers. Every activity we do fits into one of these layers.

Layer One: Audiences and Questions

When onboarding or starting a new analytics platform, we work top-down and bottom-up at the same time. This results in two simultaneous work streams that meet in the middle.

Starting at the top, we have to understand the business we work in. But we do not always have time to become experts in the business! We approach this through guided conversations or workshops.

Before building a list of Questions, we start by discussing Audiences and Subject Areas

Foundation

Audiences are groups of people who have similar questions that they need to answer. Often departments in the business function as audiences. But not always. SaaS products often cater to multiple personas, and those personas can function as audiences. 

Subject Areas are just high level, logical breakdowns of areas of the business. We do not get too pedantic here. The subject areas need to make sense to the customer and align with how they perceive their business processes. One might have a Subject Area named Customer Journey, another might have Subject Areas of Marketing, Product, and Customer Success that cover similar areas of knowledge. 

Build a list

With Audiences and Subject Areas laid out, we can focus on a list of Questions

“What do you need to know?” with followup questions like “Why is that important?” draw out a lot of information. Our average customer has several years of experience in their business or industry, and most have opinions and ideas. Plus, by avoiding technical details, we are operating in their area of strength. These workshops are easy and fun.

We don’t worry about trying to be precise on the granularity of those questions on the first pass. For example, you get specific questions from your stakeholders, such as “How much revenue did we make in product line X this week compared to the same week last year?” Or you may get very general questions. Don’t be too prescriptive. Just make a list using questions like the following:

  • What questions need to be answered?
  • Why are those questions important?
  • Who asks those questions (which Audience)?
  • How do they use the answers?

Prioritize the list, draw a line

Once we have a list, we sort it by priority and draw a line. If we have a long list, we sort the top 10-15 items and leave the rest unsorted at the bottom.

Then we draw a line. Everything above the line gets attention and is part of the first iteration’s scope. Everything below the line has to wait. Completely. No partial efforts or “go ahead because we will need it anyway.” 

The line should be high up on the list. For the first iteration, even a single question is not too short of a list. Embracing this focus allows us to move quickly to build momentum. By putting all our energy into a smaller scope, we move farther and faster than if we diffuse our energy in many directions

The result of committing to that first question (or small set of questions) is exciting. Often our stakeholders have been waiting a long time to stop spinning their wheels and start moving. This is that moment. 

Conclusion

Starting at the top and focusing on business priorities keeps us from spinning our wheels trying to understand all the data and all the systems that provide data. 

By creating and prioritizing a list, we avoid the common pitfall of trying to do too much at once. We also effectively have created a roadmap. The list of questions functions as a backlog of work that needs to be done, and the business stakeholders are in control of the work priority from day one. 

Because we restrict scope, we create short iterations that give stakeholders a chance to evaluate and adjust their priorities. 

When we start at the top, prioritizing a list, and restricting scope, we have already created a situation with a much higher likelihood of success. 

More to come in upcoming posts, when I will discuss the other three layers:

Photo by Tyler Scheviak on Unsplash

Read More
Analytics In Weeks
Progressive Analytics

Analytics in Weeks (not Months or Years!)

Most business leaders recognize the latent value of data. Yet many people think it takes 12-18 months to implement a cloud data warehouse. Why is that perspective so common? 

“Warehouse” sounds big and intimidating. “Cloud” sounds new and fraught with risk as well as possibility. And there is no shortage of consultants and product vendors promising that their offering is the silver bullet to make all this easy.

Data can be complex. But the last couple of years have seen many innovations to accelerate and simplify this process, with no sign of slowing down.

Typical Scenario

The accelerated adoption of SaaS tools and products has caused business data to be scattered across many different silos. Growth of the SaaS industry is 17% year over year and accelerating. The average mid-sized company uses 30% more SaaS apps than they did last year, averaging 137 SaaS products used per company. That is 137 data silos. When data is spread out like that, it is easy to see why people think bringing all that data together for analysis is next to impossible.

Most companies have proprietary data sources, such as application databases, that are also part of the mix. 

With this much complexity, how do you even begin to leverage data to answer questions about business operations and the customer journey? 

Data is 15 years behind

A few years ago when I dove into data analytics, I came from a software background. The innovation in software tools and processes was truly impressive at the time (and still is). Stepping into the data world, I felt like I stepped back in time. Everything was done as if these new innovations did not exist. Source control, automated tests, continuous integration–all were foreign concepts.

Because tooling and processes for working with data were so archaic, the results were abysmal. Adding new data sources took weeks or months. Troubleshooting issues took deep expertise not just in data but in the specific implementation itself. Because changes took so long, the cost of mistakes was high–which just elongated the analysis time and time to deliver anything. 

People and companies were stuck spinning their wheels, just waiting on results.

It is no wonder many people still feel like analytics efforts are beyond their grasp–historically, these efforts have been expensive, slow, and risky.

Innovations and the Modern Data Stack

Today, analytics tooling and processes are rapidly catching up, faster than I ever thought possible. The challengers are open source companies and companies built on cloud delivery models. A16z did a great job capturing this movement

Generally analytics platforms are visualized left to right, in an ETL pattern (extract, transform, load).

A key switch in pattern is moving from ETL (extract-transform-load) to ELT (extract-load-transform). Although seemingly trivial, moving the data transformations to the end rather than in the middle allows better architectural principles, such as breaking down a complex system into smaller components. 

For example, companies like XPlenty, Fivetran, and Stitch (and open source tools like Meltano, Airbyte, and Singer) can focus on extracting the data and loading it to a data lake–without attempting to modify the data. Removing the transformation responsibility greatly simplifies the problem, allowing them to focus.

At the center of the movement is data build tool (dbt Labs), which deals with transforming raw data into analytics-ready data models, capable of answering questions. Simple innovations like introducing software engineering principles of source control and templating are driving fast adoption of dbt. 

Behind all of this is the cloud, particularly cheap storage and specialized analytical databases, such as Snowflake, Redshift, BigQuery, and Rockset.

With those major components in place, other specialized tools are appearing for things such as workflow and scheduling (Airflow, Prefect, Dagster) and data quality analysis (Great Expectations, Soda SQL), and data operations monitoring (Monte Carlo).

While none of these are quite plug and play, they follow the modern data stack architecture and integrate much better with each other than the tools from the past.

Progressive Analytics

Tools are one thing, but improved processes are just as important. Applying agile principles to data efforts can turn what used to be a slow, painful process into one that delivers insights early and often. 

In my experience, most people’s mental models are slower to change than the innovations in tools and products. Most organizations still approach analytics efforts as a monolithic project. Consulting companies and IT teams continue to love big projects (that are prone to fail by the way) because a big budget creates perceived stability and stature. 

But analytics is a journey and follows a maturation process in any organization. At Datateer, we use a framework we call Progressive Analytics. It simply applies principles of agile (iterative delivery, reduce WIP) and lean (design for flow and pull, reduce waste) to deliver value early and often. This value is defined as using data to answer questions and assist decision making.

The first and most important principle is to think right to left, instead of left to right. Consider a simple visualization of an analytics system, where data flows from various sources are transformed into a data warehouse model, which is used to answer questions through dashboards or other analyses:

Most people implementing analytics think first about the data sources and how to move the data into the warehouse, then how to transform and analyze. This is left-to-right thinking, or “push” based design in lean terms. This thinking makes teams feel like they have to finish all the raw data loaders before they can move to cleaning or even think about combining data sources and transforming them, much less building dashboards.

Instead, think in terms of the questions that need answering, the audiences asking those questions, and the subject areas they pertain to. Then move to the left to design the data model only enough to answer those questions. By focusing on only a handful of questions, the amount of data modeling required is much smaller.

Then move left again to only pull in the data you need to supply the data model, to answer the questions.

By doing this, you can start to use the analytics in short order. By only focusing on the portions of the data model that are required, and only loading the fields from the source that are required, velocity is much higher, and you can iterate to expand the number of analyses and dashboards, and the number of data sources involved.

A business may need answers to hundreds of questions from its analytics platform. Taking on all these questions at once is a recipe for long projects and high risk. By starting with the questions that need answering, prioritizing and iterating become possibilities. 

Delivering the answers to a few critical questions in a few days is far superior to trying to answer all the questions a year from now. Once those first questions are answered, the flywheel has started and moving onto the next highest priority questions can follow the iterative, progressive patterns now established.

In our experience, following the Progressive Analytics framework produces results in weeks. Thanks to the innovations in tooling and products, creating iterative processes to deliver analytics is now a possibility for any organization.

References

Read More