Why we changed from bespoke to an open standard data product specification for our data mesh implementation

By Jethro Kairys and Adric Streatfeild

Apr 01, 2024

One of the most important assets we have in ANZ is data. Data is the lifeblood of our business. It is the embodiment of all the value we hold and protect for our customers, and deriving insights is both a fundamental part of our regulatory obligations and serves to give us deep insights into our business: detecting fraud, identifying market opportunities, predicting trends, pricing risk, and informing long term plans.

Without data, you can’t run a bank. Without good data, you can’t make the bank better.

Managing data across four customer divisions and across some 29 international markets is hard. Banks historically used centralised warehouses to sweep all the data into one place in the hopes that would help organise it. Turns out, that causes bottlenecks because it lowers the long-term responsibility on managing data away from the divisions that create the data.

At ANZ, we have moved towards a federated data mesh for the development, operation, and governance of our data systems. A new decentralised approach allows us to better unlock the value of analytics at scale, improve our ability to innovate, and deliver better value for stakeholders and customers.

An effective data mesh allows data consumers from anywhere in the organisation to independently find, understand and combine products from the mesh to try new ideas and derive value. This only works if interoperable products are built by different business units, and those products are well documented and self-describing; allowing data consumers to properly understand the data products they find and to decide whether they are fit for purpose.

Our data mesh implementation will rely on a thorough data product specification to formally declare and define each data product in an interoperable way. The data product specification plays a critical role in ensuring any business can derive value from its data – now, and in the future.

To support our data goals, our data product specification is multi-faceted. Here are some of the things we demand from a data product specification:

· Describe the data product’s intent (and other metadata).

· Establish quality standards, the data format, semantics, and usage guidelines.

· Form the basis for agreements between data producers and consumers.

· Ensure effective communication, coordination, and trust within the data ecosystem.

In our bespoke data product descriptor specification, this gets all wrapped into a machine-readable product.yaml file.

Our implementation is intentionally minimalistic – just enough to get the core job done and to move quickly. We could have made the specification far richer (specifying input / output rules, observability metrics, SLIs/SLOs, semantics, lifestyle considerations), but our primary objective is to prove value quickly. Although the current implementation is incomplete, we have learned quite a lot from our experience with it.

There are roughly five overarching key principles we’ve settled on as we define our data product specification:

1. Balancing global rules with local flexibility

When defining a data product specification, we’re essentially mandating a common interchange format that all business units must agree to. The sheer volume, variety and velocity of data produced by ANZ across all our environments makes it difficult to set a single universal description of all aspects of a data product. Of course, with ANZ having employed technology for over 50 years, there’s multiple systems supporting our data environments too.

Rather than trying to perfectly model all technologies and scenarios, we prefer a data product specification that leverages existing open standards and allows data product creators to construct a valid data product descriptor through a process of composition.

This is especially important when describing data product input and output ports, as the underlying technologies will vary across the organisation. We believe this is in alignment with mesh philosophy - striking the right balance between standardisation and flexibility.

2. Not letting perfect be the enemy of good

In a perfect world, our data product specification and data mesh implementation would enforce the rigour of software engineering best practices and the software engineering lifecycle, applying best practices to lifecycle data products - and indeed that is our eventual aspirational target.

But we need to remember that data product consumers, and even producers, aren’t always engineers.

Our initial implementation borrowed heavily from software conventions such as full semantic versioning and the deployment of immutable build artifacts. However, we’ve found that in many cases - especially during early data product development – this kind of rigidity has made for an overcomplicated user experience. Right now we need adoption more than we need perfection.

So as one example, we’re adopting a simpler versioning strategy: as we decompose monolith data warehouses/lakes into smaller, use-case aligned data products, it is unlikely that a given product will need to utilise the full extent of a major.minor.patch versioning system we’re all accustomed to. We just need to provide some way of preventing breaking changes - or at least communicating them – so we decided it would be okay to accept only major.minor versioning as well.

It is also worth noting that we considered calendar versioning but decided semantic versioning would be more appropriate as we prefer to allow data products to consume newer versions of their dependencies and provide a clear indication when a new version contains breaking changes.

3. Observability and service level indicators and objectives

In software and cloud engineering, the process of measuring performance at the system, application and business level is a largely solved problem. Observability tooling has evolved to comfortably support the operation of cloud native applications with reliability, performance, and quality built in.

With data products, there is no one-size-fits-all solution to observability. Different areas of the organisation have their own way of operating and monitoring data products, so there’s no sense mandating that all products expose observability metrics in a specific platform or tool from day one.

We’ve instead started with the idea of standardising on a basic set of service level objectives - currently availability and freshness - while providing flexibility in how these objectives and their indicators are calculated.

4. Schema and semantics – what is the source of truth?

When describing a dataset, where is the ‘truthiest’ place to store the schema and semantics of the dataset? Is it:

· In the data product specification that describes the intent and result of data modelling?

· In the data modelling itself? For example, in a DBT models.yml file.

· In the dataset itself once deployed?

We have a strong preference to keep this definition as close to the ‘source’ as possible, i.e., the models or the metadata that describes them, rather than force data product creators to ‘double handle’ metadata by replicating it across multiple files.

We’re favouring a solution that embeds metadata into source code, ideally with the ability to collect additional/extensible metadata via a user-friendly IDE or UI. Importantly, we’ve used this same approach with other disciplines (such as embedding metadata and specifications for APIs alongside the source code of APIs).

5. Separating code and configuration

Another software engineering principle we borrow heavily from is the concept of separating code from configuration. Doing this allows cloud native applications to be built once and deployed many times, typically into various environments, before reaching production.

This concept is also relevant to our selection of data product specification as the data product specification may include aspects of code, configuration, and runtime metadata. For example, a product’s inputs need to be defined during the development and build stage, and the values will likely change depending on what environment the product is deployed to. For outputs, the value may refer to a resource (such as a BigQuery Dataset or Pub/Sub topic) that is not known until after the product is deployed, i.e., a runtime value.

With our existing implementation, the line between code, configuration and runtime data has become blurred. A good data product specification draws clear distinctions between these aspects of the product’s metadata, making the process of developing, deploying, and using data products as seamless as possible.

Reasons for our change of approach

Our bespoke data product descriptor specification has served us well so far, but we’ve decided that to scale our mesh we would be better served by using an open standard data product specification. That’s because:

· It will easily address the missing functionality in the existing implementation.

· We now have a better idea of what we require based on the work so far.

· There will be better compatibility across different parts of the business.

We are confident this new approach will put us in a better place from which to expand the mesh, helping ANZ take the next step on our data mesh journey.

Assessing potential standard data product specifications

We are now exploring using a standard data product specification instead of our original bespoke one. There are several possible data product specifications we could use that are both open-source and technology-agnostic.

Stay tuned for our next article where we explain the assessment we conducted and the standard that we chose to use across the 40,000 people in ANZ.

Jethro is a Staff Data Engineer at ANZ. He is a hands on solution Architect with 10+ years of experience in Data, Software Engineering, and DevOps. He is a Python and PySpark expert with proficiency in tools like dbt, Databricks, React, TypeScript, and Kubernetes. He is passionate about cloud technologies, CI/CD, GitOps, and test-driven development and believes in keeping things simple and shipping early and iterating to improve based on feedback.

Adric is Data Engineer Chapter Lead at ANZ

This article contains general information only – it does not take into account your personal needs, financial circumstances and objectives, it does not constitute any offer or inducement to acquire products and services or is not an endorsement of any products and services. Any opinions or views expressed in the article may not necessarily be the opinions or views of the ANZ Group, and to the maximum extent permitted by law, the ANZ Group makes no representation and gives no warranty as to the accuracy, currency or completeness of any information contained.