Why we're adopting the Open Data Mesh Product Descriptor

By Jethro Kairys and Adric Streatfeild

Jul 04, 2024

In our previous article (read it here) we discussed what we've learned through our use of a bespoke data product descriptor, and what we're looking for in a replacement.

In this article, we'll introduce the framework that we used to evaluate our options, run through the contenders, and arrive at our preferred solution.

How will we decide?

One of the main reasons ANZ went looking at existing external standards was that we felt that the process itself would help drive adoption across the enterprise.

To help start this conversation, and add some objectivity and rigour to the decision, we put forward some assessment criteria based on what we believed would be important to us and our specific mesh implementation needs at the time.

· Adoption / community: is it adopted by more than one organisation? Are there multiple contributing organisations?

· Maturity: how thoroughly documented is it? Does it show signs of evolution such as versioning? Does it leverage and reference existing standards where appropriate?

· Semantics: how well does it address our known enterprise metadata capture requirements?

· Lifecycle concerns: does it accommodate versioning, environment, and lifecycle status of a data product?

· Observability: how does it support the description of observability measures? Will it generalise across platforms?

· Data quality: does it provide guidance regarding the standardisation of data quality measurement?

· Interoperability: would it work equally well outside of Google Cloud Platform?

· Extensibility: would it allow us to patch missing functionality in a forward-compatible manner?

We knew we would need a community of practitioners across the bank to be part of the decision making process, so we established that community, modelled lightly on industry standards bodies. Our community was made up of 34 people from across every division of the bank.

It is worth noting that our assessments, and our decisions, are made based on the community of practitioners’ knowledge of the products and opinion of how well they will meet the project’s - and ANZ’s - current needs (endorsement of the decision was given in August 2023). They are not a reflection or judgement on the quality and performance of the products themselves, or a recommendation or inducement to purchase.

Options considered

We considered the following open standards:

· PayPal - Data Contract Template

· Open Data Product Initiative - Open Data Product Specification

· Agile Lab - Data Product Specification

· Open Data Mesh - Data Product Descriptor Specification

PayPal - Data Contract Template

PayPal have open sourced the data contract template which is being used in the implementation of Data Mesh at PayPal. Provided as a markdown file with annotated YAML, the project has over 400 stars and no open or closed issues at the time of writing.

Design

The contract has a relatively flat structure, which appears somewhat tightly coupled to PayPal's own data mesh implementation. The most interesting parts of the contract for our purposes are:

· Demographics: this is the main description of the product, its purpose and provenance. Interesting features are the inclusion of a userConsumptionMode field, which the document notes will probably be replaced by output ports in future.

· Dataset and schema: describes the dataset and its schema such as column level metadata. Interesting features include the use of column-level authoritativeDefinitions to link to business definitions and other documentation, as well as sample values and transformation logic.

· Data quality: describes the data quality checks to be executed against the dataset, including a desired tool, rule and schedule at the table and column level.

· Roles: a list of roles and approvers that will provide user access to the dataset.

· Service level agreement: allows a number of assertions such as latency, retention, frequency to be made based on the value of a particular column (e.g. a particular timestamp), and the driver behind the SLA (e.g. regulatory or analytics).

The Paypal Data Contract template brings with it some interesting ideas such as:

· exposing the logic that was used to define a particular column

· providing a scaffolding for data quality tool configuration

· consideration of how to support access requests, and

· a mechanism to configure metric collection to support defined SLOs

Open Data Product Initiative - Open Data Product Specification

The Open Data Product Initiative aims to provide an open-source community within which participants can contribute to a vendor-neutral, portable open data product specification. Their Open Data Product Specification is a machine readable, vendor-neutral, data product metadata model. It defines the objects and attributes as well as the structure of digital data products. The work is based on existing standards (schema.org), best practices and emerging concepts like Data Mesh. The product uses a fork-based approach to versioning but has approximately 10 stars for its latest major version (2.0).

Design

The specification uses a single level nested structure to represent various entities, in a single YAML file. Some of these metadata entities are relatively minimalistic, with a more extensive set of optional values. Relevant sections of the specification are detailed below:

· Document level attributes: identification, version, intent and purpose.

· Data Ops: defines deployment mechanism, scripts and tooling.

· Data interface: a single mechanism for data access, such as API, SQL, and the configuration required to use this interface.

· Data SLA: the desired and promised quality of the product, including key metrics such as update frequency, and response time. Also contains contact details for support, logs, dashboards, etc.

· Data quality: assertions regarding the quality of the product using standardised dimensions such as accuracy, completeness, consistency, timeliness, validity and uniqueness. Does not describe their current values, how they are calculated or how they may be observed.

· Data holder: identifies who can grant access to the data.

· Extensions: allows for additional data to extend the specification at certain points using an element prefix of ^x-, the value can be null or any valid JSON value.

Agile Lab - Data Product Specification

Agile Lab provide a data product specification based on the principles of technology independence, extensibility, and data product as an independent unit of deployment. They see the specification as fundamental to creating services for automatic deployments and interoperable components. The project GitHub repository has 48 stars, 7 contributors, a dozen or so issues and uses the Apache-2.0 License.

Extension and modification

The specification also provides guidance around extension and modification, however these are framed as contribution guidelines to the project repository, rather than an approach to adaptation and adoption. The guidance covers four types of changes:

· Encouraged: customisation of specific sections and introduction of new metadata fields are encouraged

· Allowed: relaxation or restriction of mandatory field

· Discouraged: renaming, moving and deleting field, and

· Forbidden: changes to reserved fields (indicated with an * in the documentation)

Design

The specification is structured around a general information component and four technical traits: Output ports, Workloads, Storage areas and Observability. Fields within these components are strongly typed, however the documentation consists of a single markdown file (no JSON schema or website with integrated examples). A high-level overview of the document structure is provided below:

· General information: includes data product identification, intent and ownership details. Also supports lifecycle status, maturity, billing and tags. Uses a unique identifier with a namespace based on the data product domain, name and major version. Some fields (such as SLA) are unstructured (optional string) and ownership details are assumed to be coupled to LDAP groups. There is also an untyped section for specific information, which is recommended to be used for infrastructure, configuration and other documents. While useful, without any typing or requirement to define the specification used, this may lead to diverging implementations.

· Output ports: technical identifiers required for data access and a data contract that defines the schema (suggested to use Open Metadata specification) and SLA.

· Workloads: describe the infrastructure and data pipelines that comprise the data product application logic, these may also describe the data product's lineage via the input field of DataPipeline workloads.

· Storage area: identifies where data is stored for the data product.

· Observability: described for each output port, including an API endpoint and a set of assertions.

Open Data Mesh - Data Product Descriptor Specification

The Open Data Mesh Data Product Descriptor Specification is a declarative, technology-independent open standard that can describe a data product and its components via a JSON or YAML document. It has been developed by Quantyca, a technology consulting company specialising in data and metadata management. It is intended to be standard agnostic, composable through templating, flexible and extensible. The GitHub repository for the project has 51 stars and a dozen or so issues, but only three contributors.

Documentation

The specification is very thorough and is provided as a website with excellent documentation detailing the descriptor's key concepts, a quick-start tutorial, an extensively documented specification and corresponding JSON Schema.

Design

The design is intended to align with two popular existing specifications in the software and data engineering communities - the OpenAPI and AsyncAPI initiatives. These make the learning curve as smooth as possible. The design also leverages other standards including:

· RFC-3986 - relative referencing

· RFC-4122 - universally unique identifiers (UUIDs)

· RFC-6838 - media types

The descriptor is composable via nesting, and object types are defined which represent various aspects of the data product's metadata. It is also extremely extensible, providing a flexible combination of:

· specification extension points (i.e. prefix new attributes with x-)

· reference objects (allows referential reuse of objects both within and external to the document)

· standard definition objects (which allow the user to formally describe an object using a specified standard specification)

Other noteworthy aspects of the descriptor include its use of:

· Ports: the specification allows input, output and other port types to be described in terms of their intended or expected behaviour using a standard such as OpenAPI, AsyncAPI or another user defined specification.

· Namespaces: an opinionated namespace structure allows both the descriptor as a whole, and its components to be uniquely identified via a universal resource name (URN).

· Contact points: describe the contact information for a product, with consideration given to supporting a range of channels such as email, Slack, Teams, etc.

· Levels of abstraction: the specification provides a clear boundary between aspects of the descriptor that are intended to describe the product's boundaries and interactions with other systems (interface components), and the inner workings of the data product itself (internal components).

Decision

Through this evaluation we discovered several viable alternatives to our existing bespoke data product descriptor implementation.

Any one of the alternatives would represent an incremental improvement over our current implementation, however on balance we feel that the Open Data Mesh - Data Product Descriptor Specification represents the best choice for ANZ right now (see the table below).

In a future article, we'll go deeper into the Open Data Mesh Data Product Specification and explain how we intend to leverage it at ANZ to create our self-service data mesh platform.

Other resources

The data mesh governance by example repository contains curated examples for data mesh values, operating models and global policies. It includes example specifications, operating models and architectures that may be of use in any data mesh implementation.

This Medium article ‘Data contracts ensure robustness in your data mesh architecture’ explains how data contracts offer transparency into data ownership, usage, and dependencies, acting as a solution to technical challenges by ensuring interface compatibility, versioning, and managing compatibility in a modern data mesh framework.

Jethro is a Staff Data Engineer at ANZ. He is a hands-on solution architect with 10+ years of experience in Data, Software Engineering, and DevOps. He is a Python and PySpark expert with proficiency in tools like dbt, Databricks, React, TypeScript, and Kubernetes. He is passionate about cloud technologies, CI/CD, GitOps, and test-driven development and believes in keeping things simple and shipping early and iterating to improve based on feedback.

Adric is Data Engineer Chapter Lead at ANZ.

This article contains general information only – it does not take into account your personal needs, financial circumstances and objectives, it does not constitute any offer or inducement to acquire products and services or is not an endorsement of any products and services. Any opinions or views expressed in the article may not necessarily be the opinions or views of the ANZ Group, and to the maximum extent permitted by law, the ANZ Group makes no representation and gives no warranty as to the accuracy, currency or completeness of any information contained.