How can you design a data analytics platform that follows best practices, yet remains flexible to your unique needs? This is daunting.
In this article, I’ll break down a reference architecture, with a special focus on the needs of credit unions. Here is how a reference architecture can help:
- Learn from the many iterations of others
- Get to business outcomes sooner
- Easier to build, easier to operate
- Get smoother communication from a clear picture
Designing a data analytics platform for credit unions can be a complex task, requiring a deep understanding of the organization's needs and priorities. Having a reference architecture helps. It provides you with a solid foundation for designing a data analytics platform. It creates a standardized approach to data management, data governance, and data usage. This makes it easier to design, build, and maintain a data analytics platform, ensuring that it meets your specific needs.
By having a well-designed data analytics platform, credit unions can improve their business outcomes in several ways. They can leverage data to generate higher ROI on marketing activities, improve member experience by offering personalized products and services, and increase operational efficiency by automating processes. These benefits can help remain competitive in the financial industry, and provide value to members.
This reference architecture is derived from Datateer’s Managed Analytics service, which powers data analytics for all of our customers. It is the result of years of refinement, hardened in the messy real world.
In a later article, I will address applying the Simpler Analytics framework to your data analytics operations. That framework focuses on process and organization, and is complementary to the reference architecture described in this article.
Let’s dive in.
Visualizing your modern data analytics platform
First, let’s visualize a mature architecture. But you should not attempt to build everything at once! Further down, I will break this down into an approachable roadmap.
Below is a diagram of a mature system, emphasizing the flow of data from raw data sources to audiences that consume analytics.
Here is a breakdown of each component, why it matters, and examples of vendors that provide products or tools for each
The general outline is:
- A Data warehouse as a centralized place for metrics, with organization-wide definitions and calculations
- Sometimes, a data lake to simplify ingestion and aid exploratory efforts
- Ingestion tooling and processes to gather siloed and scattered data from throughout the organization
- Transformation technology combines data from multiple sources and calculates metrics and KPIs
- Orchestration tooling and processes to schedule and automate.
- Data governance tooling and process to ensure regulatory compliance and data privacy
- Automated data quality measurement to ensure the integrity and accuracy of data.
- Business intelligence to enable exploration and self-service.
- Data activation tooling to make metrics and Member 360 data available in other systems like CRM and online banking systems.
- Data operations (aka “DataOps”) for when things go wrong (and they will).
- Infrastructure management to ensure high availability, reliability, cost control.
A data warehouse is a database. In spite of all the marketing messages about innovation (which is true, there is a lot of innovation happening), in the end this is a place to store data, with a SQL query interface on top.
The warehouse is the core to everything in your data analytics system. Leading vendors are Snowflake and Databricks. Google BigQuery, Amazon Redshift, Microsoft Synapse are strong offerings tied to their cloud platforms.
A data lake is a place to dump data, a centralized place where it is easy to collect data from operational systems. The key challenge with ingesting data is finding the right balance between:
- The ease of getting the data out of the data source and into a central location
- Getting from unstructured or semistructured into a known schema you can use in analytics
A data lake is the bridge between ingestion and structured analytics. At Datateer, we use a simple setup of storage buckets in AWS or GCP. An alternative is to designate a segment of a data warehouse as the lake. Snowflake and Databricks both are pushing towards this pattern (in part because it increases your reliance on their products).
The lake handles what we call “preprocessing” which is basic cleansing and technical checks.
But the lake should not apply any business logic. The data should be structured and accessible as it arrived from the data source.
Ingestion, or data replication, is all the processes involved in extracting data from data sources like operational systems, SaaS APIs, application databases, etc. The goal is getting data from all areas of your business into a central location so that the data can be analyzed collectively.
The most important part of the ingestion component is deceptively simple. Modern data analytics follows an extract-load-transform pattern (ELT), not the older extract-transform-load pattern (ETL). By waiting until after data is extracted and loaded (EL) before performing transformation (T), you remove many problems that have long plagued data analytics efforts.
At Datateer, we have found it helpful to define a set number of extraction strategies or patterns. When we are dealing with a new data source or trying to understand an existing one better, it helps us to know which extraction pattern is applied. Here are some examples:
- Push - the data source (or the team that manages it) is responsible for pushing extracts into the lake. Variants of this strategy might be an ERP that pushes scheduled reports or CSV files that an external system deposits in the lake
- Pull - the data source provides something like an API or database connection that allows us to extract data. These are often scheduled extractions. Variants of this pattern include applying incremental extracts or change data capture (CDC) patterns.
- Streaming - for truly real-time needs, data can be streamed or be pushed to the lake in extremely small batches.
In the past 5 years, specialized vendors have created many prebuilt extractors. The cost and reliability of these means custom-built extractors should be a rare case. At Datateer, we only custom-build about 15% of data extractors. Credit Unions have a number of niche systems like core systems and loan origination systems. You may need to build 40-50% of your extractors.
We have had positive experiences relying on extractors from Matillion, Portable, and Fivetran. For custom-built extractors, we find success with Meltano and Airbyte.
Transformation refers to combining, aggregating, normalizing, and performing calculations. It’s all the code and logic to get data from its raw state into an analytical data model designed for reporting. The leading solutions in modern data analytics are Matillion and dbt Labs.
With so many moving parts, orchestration is a key component of modern data analytics.
Orchestration includes scheduling of technical jobs, dependency management, and routing status and issues.
Vendors we have had positive experiences with that specialize in data pipeline orchestration include Prefect and Dagster. Airflow is a mainstay, but we found it to be monolithic and outdated for our needs. Products like Matillion provide a more integrated solution, combining capabilities of orchestration and transformation into a single user experience.
Credit unions must comply with regulatory requirements and data privacy laws. For many credit unions, this is a key risk that slows down innovation. PII and personal financial information is tied to every service a credit union offers.
Governance requires policies and standards about who can see data and how they can use it. It also requires tooling to implement and enforce these policies.
An unfortunate reality of data governance is that it can balloon quickly and get complicated. The innovation it is intended to unlock becomes even more bogged down by an unwieldy data governance process.
ALTR is a data access and control vendor that specializes in the financial services industry. Immuta is also a popular vendor.
One of the biggest killers in any data analytics is poor data quality. This often stems from incomplete or inaccurate data from the upstream sources. But it also results from timing issues, calculation errors, or changing business needs that render existing metrics insufficient.
No mainstream standard exists to measure data quality. At Datateer, we like to split it into two separate concepts: fitness for purpose from a business perspective, and technical health.
However you organize, you need tools for automatically measuring quality, troubleshooting, and managing issues and resolutions.
We use a variety of tools at Datateer around data quality. Vendors that provide observability and testing include Metaplane, Monte Carlo, Datafold, Soda SQL, and Qualytics.
Business intelligence is the tip of the iceberg–it is what everyone thinks of when they think of analytics, because it is the most visible piece.
Credit unions need curated, managed, controlled reports and dashboards. If you also provide exploratory tools, you will get much more engagement and build momentum.
Some products, like Sigma Computing, Astrato, and Hex, focus on curated dashboards as well as exploration use cases.
Vendors that focus only on dashboarding and reporting include Tableau, Power BI, and Qlik.
Activation focuses on scenarios outside of business intelligence. It’s a big topic that can be summarized in a few patterns:
- Pushing data back out. Sometimes labeled “reverse ETL” this means taking your enriched data from the warehouse and pushing it back out to things like the CRM or core system. Vendors that specialize in this include Hightouch and Census. Ingestion vendors like Matillion and Integrate.io are now adding this capability to also perform “reverse ETL.”
- Making data discoverable. The more accessible and understandable data sets are to the broader organization, the more traction any data analytics effort will have. This is especially important with people scattered in various branches. Vendors that focus on discoverability and cataloging are Select Star, Castor, and Atlan.
- Member-facing analytics. Getting data out to members, business partners, and regulators can be just as important as making data accessible internally. Many BI tools provide “external embedding” capabilities. Delivery Layer and Datateer’s Embed Portal are two products that enable this.
- Machine learning is a very hot topic. The curated, centralized warehouse is the perfect input into machine learning efforts.
“DataOps” or Data Operations refers to keeping the system running and improving.
A key situation is when something breaks in the orchestration. With so many moving parts, including upstream dependencies on data sources, things are bound to go wrong.
Requests come in for analyses and help with all kinds of questions around data. Managing and fulfilling these requests is a service a central data team can provide to the broader organization.
Many systems targeted to IT are suitable for data operations. At Datateer, we use Freshworks. The key is integrating the orchestration tool with the operations tool.
Many credit unions have been hesitant to embrace cloud infrastructure. This is changing from pressure from members and partners, maturing security by cloud providers, cost improvements, and scalability needs of credit unions.
The decision of which cloud infrastructure provider to use is typically larger than the data architecture planning.
If a credit union is lagging in maturity around cloud adoption and security, a data infrastructure is an excellent opportunity to advance the organization’s capabilities as a whole.
Many data product vendors will support AWS, GCP, and Azure.
Starting simple, maturing over time
Creating this from a blank slate can be difficult. However, I am certain you are not starting from a blank slate! Legacy systems, difficulties in accessing data, and ambiguity of critical entity definitions are all too common in credit unions.
The way to simplify this is to build in layers. Starting with the components that will bring the most immediate value, you can then layer on more capabilities. Thinking in iterations can transition naturally into a roadmap. Below is an example of a series of iterations to do just that:
Level 1: Show me the numbers
Avoid getting bogged down with all the underpinnings and start with a single audience and single analysis use case. This still requires much technical implementation, so make the business scenario as narrow as possible.
In this stage, stand up a warehouse and lake, and a BI tool. Choose two systems to begin ingesting–the core system and a CRM are good places to start. But don’t ingest everything–focus on the data from those systems that support your narrow use case.
Level 2: Automate and protect
Going very far without ensuring governance and security measures are in place puts too much risk on a credit union’s data operations.
In this stage, automate the use case from level 1 with orchestration, and add one or two more use cases. Keep them narrow. Your analyses do not need all the available data at this point, so defer what you can on the ingestions.
Apply a layer of data governance (recognize that this will mature over time).
Level 3: Increase Quality and Reliability
As you continue to increase supported analyses and data sets, the reliability of the system and quality of the data become critical to continued momentum and success.
Start your data operations by establishing processes to handle requests and incidents.
Implement data quality tooling. Include measurements of technical health as well as accuracy and completeness from a business perspective.
Level 4: Activation
Now that you support a variety of subject areas, consider how to activate data beyond BI and exploratory tools. Consider how to automatically push data into operational processes, member-facing touchpoints, and self-service discovery.
Sign up to read other articles in this series
Congratulations! You are well on your way to a data architecture based on proven practices. This will set up your credit union for success in data analytics and provide a foundation for future growth.
Was this article informative and helpful? Register to get the other articles in this series:
- Modern Data Analytics: a Must-Have for Credit Unions
- Modern Data Analytics in Credit Unions: a Reference Architecture
- The Power of Member 360 for Credit Unions
- Designing Credit Union Member 360s: a Reference Model
- Checklist for Data Governance in Modern Data Analytics for Credit Unions
- Applying the Simpler Analytics framework to Credit Union Data Analytics